%% Commands for TeXCount %TC:macro \cite [option:text,text] %TC:macro \citep [option:text,text] %TC:macro \citet [option:text,text] %TC:envir table 0 1 %TC:envir table* 0 1 %TC:envir tabular [ignore] word %TC:envir displaymath 0 word %TC:envir math 0 word %TC:envir comment 0 0 %% %% %% The first command in your LaTeX source must be the \documentclass %% command. %% %% For submission and review of your manuscript please change the %% command to \documentclass[manuscript, screen, review]{acmart}. %% %% When submitting camera ready or to TAPS, please change the command %% to \documentclass[sigconf]{acmart} or whichever template is required %% for your publication. %% %% %\documentclass[sigconf, anonymous, review, nonacm=true]{acmart} \documentclass[sigconf]{acmart} %% %% \BibTeX command to typeset BibTeX logo in the docs \AtBeginDocument{% \providecommand\BibTeX{{% Bib\TeX}}} %% Rights management information. This information is sent to you %% when you complete the rights form. These commands have SAMPLE %% values in them; it is your responsibility as an author to replace %% the commands and values with those provided to you when you %% complete the rights form. \copyrightyear{2025} \acmYear{2025} \setcopyright{rightsretained} \acmConference[RAPIDO '25]{Rapid Simulation and Performance Evaluation for Design}{January 21, 2025}{Barcelona, Spain} \acmBooktitle{Rapid Simulation and Performance Evaluation for Design (RAPIDO '25), January 21, 2025, Barcelona, Spain} \acmPrice{} \acmDOI{10.1145/3721848.3721850} \acmISBN{979-8-4007-1471-9/25/01} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %% Document Settings %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \usepackage{subcaption} \usepackage{siunitx} \usepackage{tikz} \usetikzlibrary{patterns,arrows,decorations.pathreplacing} \usetikzlibrary{arrows.meta} %\usetikzlibrary{arrows,automata} \usetikzlibrary{positioning} \usetikzlibrary{positioning,shadows,trees} \usepackage{pgfplots} \usepackage{pgfplotstable} \usepgfplotslibrary[groupplots] \usepackage{tikz-timing}[2009/12/09] \usetikztiminglibrary{overlays} %%% Timing Diagram Setup %%% %Define different DRAM commands: \tikztimingmetachar{A}{1.0D{\texttt{ACT}}} \tikztimingmetachar{P}{1.0D{\texttt{PRE}}} \tikztimingmetachar{X}{1.0D{\texttt{DES}}} \tikztimingmetachar{R}{1.0D{\texttt{RDA}}} \tikztimingmetachar{W}{1.0D{\texttt{WR}}} \tikztimingmetachar{O}{1.0D{\texttt{NOP}}} \newcommand{\timemeasure}[4] { \draw [red,semithick] ($ (#1) - (-0.1,0) $) -- ($ (#1) - (-0.1,#3) -(0,1) $); \draw [red,semithick] ($ (#2) - (-0.1,0) $) -- ($ (#2) - (-0.1,#3) -(0,1) $); \draw [red,semithick,>=triangle 60, {Latex}-{Latex}] ($ (#1) - (-0.1,#3) $) -- ($ (#2) - (-0.1,#3) $) node [below,midway] {#4}; } \newcommand{\timemeasuup}[4] { \draw [red,semithick] ($ (#1) - (-0.1,0) $) -- ($ (#1) - (-0.1,#3) -(0,-1) $); \draw [red,semithick] ($ (#2) - (-0.1,0) $) -- ($ (#2) - (-0.1,#3) -(0,-1) $); \draw [red,semithick,>=triangle 60, <->] ($ (#1) - (-0.1,#3) $) -- ($ (#2) - (-0.1,#3) $) node [above,midway] {#4}; } \newcommand*\circled[1]{ \tikz[baseline=(char.base)]{ \node[shape=circle,draw,inner sep=0.5pt,fill=white] (char) {\scriptsize#1}; } } \newcommand*\circledx[1]{ \tikz[baseline=(char.base)]{ \node[shape=circle,draw,inner sep=0.1pt,fill=white] (char) {\tiny\tiny#1}; } } \usepackage{circuitikz} \newcommand\todo[1]{\textcolor{red}{#1}} \hyphenation{pre-charged} \hyphenation{DRAMPower} \hyphenation{DRAMSys} \hyphenation{VAMPIRE} %\received{20 February 2007} %\received[revised]{12 March 2009} %\received[accepted]{5 June 2009} %% %% end of the preamble, start of the body of the document source. \begin{document} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %% Header %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \title{DRAMPower~5: An Open-Source Power~Simulator for Current~Generation DRAM~Standards} %% %% The "author" command and its associated commands are used to define %% the authors and their affiliations. %% Of note is the shared affiliation of the first two authors, and the %% "authornote" and "authornotemark" commands %% used to denote shared contribution to the research. \author{Lukas Steiner} \orcid{0000-0003-2677-6475} \affiliation{% \institution{University of Kaiserslautern-Landau} \city{Kaiserslautern} \country{Germany} } \email{lukas.steiner@rptu.de} \author{Thomas Psota} \orcid{0009-0009-3368-5396} \affiliation{% \institution{Fraunhofer IESE} \city{Kaiserslautern} \country{Germany} } \email{thomas.psota@iese.fraunhofer.de} \author{Marco Mörz} \orcid{} \affiliation{% \institution{Fraunhofer IESE} \city{Kaiserslautern} \country{Germany} } \email{marco.moerz@iese.fraunhofer.de} \author{Derek Christ} \orcid{0009-0005-4234-6362} \affiliation{% \institution{Julius-Maximilians-Universität} \city{Würzburg} \country{Germany} } \email{derek.christ@uni-wuerzburg.de} \author{Matthias Jung} \orcid{0000-0003-0036-2143} \affiliation{% \institution{Julius-Maximilians-Universität} \city{Würzburg} \country{Germany} } \email{m.jung@uni-wuerzburg.de} \author{Norbert Wehn} \orcid{0000-0002-9010-086X} \affiliation{% \institution{University of Kaiserslautern-Landau} \city{Kaiserslautern} \country{Germany} } \email{norbert.wehn@rptu.de} %% %% By default, the full list of authors will be used in the page %% headers. Often, this list is too long, and will overlap %% other information printed in the page headers. This command allows %% the author to define a more concise list %% of authors' names for this purpose. \renewcommand{\shortauthors}{Steiner et al.} %% %% The abstract is a short summary of the work to be presented in the %% article. \begin{abstract} As off-chip memory accesses nowadays dominate the overall power consumption of many compute platforms, accurate DRAM power simulation models are an important tool for system designers. Unfortunately, existing open-source models only support older generations of DRAM standards, while current system designs mainly rely on the newest generation including DDR5, LPDDR5 or HBM3. In addition, the existing models are not directly applicable to the new standards because of the much higher data rates and newly introduced features. This paper presents DRAMPower~5, a completely revised version of the popular DRAMPower simulator, which uses newly developed core and interface power models to support the current generation of DRAM standards. In addition, DRAMPower~5 features a redesigned software architecture that enables both fast and accurate simulation. The tool is open source and available on GitHub. \end{abstract} %% %% \begin{CCSXML} 10010583.10010662.10010674 Hardware~Power estimation and optimization 500 10010583.10010600.10010607.10010608 Hardware~Dynamic memory 500 10010147.10010341.10010342 Computing methodologies~Model development and analysis 300 \end{CCSXML} \ccsdesc[500]{Hardware~Power estimation and optimization} \ccsdesc[500]{Hardware~Dynamic memory} \ccsdesc[300]{Computing methodologies~Model development and analysis} %% Keyword settings \keywords{DRAMPower, DRAM, power, energy, simulation, interface} \maketitle %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %% Body Text %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %\input{content/01_intro} %Intro %Related %DRAM Background: Short Intro of DRAM Interface and Core, single ended bidirection DQ, differential data strobe, data sampled when DQS\_t and DQS\_c cross -> double data rate % \section{Introduction} % The recent expansion of memory-intensive applications has led to increased demand for DRAM bandwidth and capacity in current computing systems. This demand is particularly pronounced in AI applications, where specialized accelerator chips with immense DRAM bandwidths beyond 1\,TBps are used. On these platforms, memory dominates the total power consumption. %However, these bandwidths come at the cost of high power consumption. When training large AI models, it was found that up to \SI{90}{\percent} of the system power is consumed by memory accesses~\cite{bou_24}. %Even in embedded augmented reality devices for the Metaverse, memory can account for more than 40\,\% of power consumption \cite{yankao_24}. Therefore, an accurate estimation of DRAM power consumption is critical in the early stages of design to properly dimension the power supply circuits and cooling. In consumer devices, on the other hand, it was also found that in average more than \SI{60}{\percent} of the total system power is spent on memory accesses~\cite{borgho_18}. While the overall power budget for these devices is limited to only a few watts, it is equally important to accurately estimate the DRAM power consumption, for example, to explore the power saving potential of software improvements during system design. In the current state of the art, there are two widely used open-source simulation tools for estimating DRAM power consumption, namely \textit{DRAMPower}~\cite{kargoo_14} and \textit{CACTI-IO}~\cite{joukah_15}. DRAMPower focuses on the DRAM core, while CACTI-IO models the DRAM interface. Unfortunately, both tools only provide support for older standards. At the same time, current generation standards like DDR5, LPDDR5 and HBM3 enable much higher interface speeds and offer new features, which requires special consideration for power modeling. In addition, the standards are inconsistent in specifying operating currents, making it difficult to create a universal power model. To the best of our knowledge, there is no open-source DRAM power simulator that provides accurate models of both the DRAM core and interface and supports the current generation of DRAM standards, including DDR5, LPDDR5 and HBM3. To fill this gap, we introduce DRAMPower~5, a completely revised version of the DRAMPower simulator with more accurate power models, an improved software architecture, and support for the latest DRAM standards. In this paper, we make the following new contributions: \begin{itemize} \item We present a universal core power model that can handle the different operating current specifications of each standard. \item We introduce a newly developed interface power model with improved accuracy at high operating frequencies. \item We present the updated software architecture of the simulator and evaluate its simulation speed and accuracy. %\item We present newly developed core and interface power models that are required to accurately capture current generation DRAM standards. %\item We explain how the inconsistent and incomplete operating current specifications provided in the DRAM standards need to be treated to model core power. %\item We show that at high operating frequencies, the approximations commonly used for interface power modeling result in large errors and a different modeling approach is required. %\item We present a new simulator architecture that can be easily extended by new standards or features and achieves high simulation speeds. %\item \todo{Accuracy simulations} %\item \todo{supported standards!!!} \end{itemize} % The remainder of the paper is structured as follows. Section~\ref{sec:related} discusses related work on DRAM power modeling. Section~\ref{sec:background} provides the reader with the necessary background on DRAM. In Section~\ref{sec:core_power_modeling}, the core power modeling is explained, while Section~\ref{sec:interface_power_modeling} addresses the interface power modeling. Section~\ref{sec:simulator} provides a short overview of the new simulator. Finally, Section~\ref{sec:conclusion} concludes the paper and gives an outlook on future work. % % \section{Related Work}\label{sec:related} % In this section, we provide an overview of related work. An often used DRAM power model is the System Power Calculator by Micron~\cite{micron_ddr3_11_kopie_ipsj}. It is provided in the form of spreadsheets for various JEDEC standards. % (DDR/2/3/4, LPDDR2/3/4/4X). %The power estimation is based on datasheet currents and timings for a specific DRAM device and workload specifications like the read/write ratio and time that the DRAM is in each state. The power estimation is based on DRAM timing and current datasheet values and workload specifications like the read/write ratio. However, this modeling approach can only achieve a limited accuracy because the actual command trace that is issued to the DRAM by the memory controller is not considered. In addition, there exist no spreadsheets for current generation standards. A more accurate simulation tool is DRAMPower~\cite{kargoo_14}, which also relies on datasheet values, but in addition uses a real DRAM command trace as input to model the internal state transitions with cycle accuracy. %Since the internal DRAM states were initially simplified and the power dependence on the number of active DRAM banks was not taken into account, DRAMPower was later extended by a bank-sensitive model in~\cite{junmat_16b,matzul_17} to improve its accuracy. Since initially the power dependence on the number of active DRAM banks was not taken into account, DRAMPower was later extended with a bank-sensitive model~\cite{junmat_16b,matzul_17} to improve its accuracy. Still, the tool has two drawbacks: it only models core power, but no interface power, and it has not been updated to the latest standards yet. Another simulator similar to DRAMPower is VAMPIRE~\cite{ghoyag_18}. This tool puts its focus on the power variations between different DRAM modules, within one DRAM module depending on the access location, and the data value dependency. VAMPIRE is calibrated with measurements of real DRAM modules and provides very accurate results. However, this presupposes that real measurements are available for the devices to be used, which is not usually the case in the early stages of design. Additionally, VAMPIRE supports DDR3 only. In~\cite{vog_10}, an analytical DRAM core power model is presented. It approximates the power consumption based on the internal device architecture and technology and can also be extrapolated to future technologies. Since the model was already developed 15 years ago, it is not clear whether an extrapolation to current technologies will still provide accurate results. %When it comes to DRAM interface power modeling, the most popular software is CACTI-IO~\cite{joukah_12,joukah_15}. The most popular tool for DRAM interface power modeling is CACTI-IO~\cite{joukah_15}. CACTI-IO does not rely on datasheet currents, but it uses an equivalent circuit diagram of the interface between DRAM controller and devices. The power consumption is then calculated with a simplified network analysis. While the results are accurate for older generation standards, the simplifications introduce a large error for current generation standards as they support higher data rates. In summary, there is no publicly available DRAM power simulation tool capable of modeling both core and interface power of current generation DRAM standards with high accuracy. % % \section{DRAM Background}\label{sec:background} % This section provides the necessary background on the DRAM core and interface relevant to power modeling. It also briefly introduces the different families of DRAM standards and explains their key differences. % \subsection{Core} % DRAM is a type of memory primarily optimized for low cost per bit. To achieve high memory density, the chips are internally organized in a hierarchical fashion consisting of \textit{columns}, \textit{rows}, \textit{banks} and, for newer standards, \textit{bank groups}. When data should be read or written from or to a column, the corresponding row must be \textit{activated} first. Within each bank, only one row can be active at a time and the bank must be \textit{precharged} before a new row can be activated. Data is transferred over the interface in a burst fashion, i.e., for a read operation, a large amount of data is first fetched internally in parallel from the array to the interface, before it is transferred to the memory controller in multiple beats. Information is stored as an electrical charge held in a tiny capacitor. As the capacitor leaks this charge over time, each DRAM cell must be \textit{refreshed} regularly (usually every 32 to \SI{64}{\milli\second}). The refresh operation is triggered externally by the memory controller with a refresh command. During refresh, no data can be accessed within the target bank(s). Thus, only a few rows are refreshed each time to avoid long access delays and a refresh command is sent every few microseconds. To save energy, DRAM devices can be put into a \textit{power-down} mode when no data accesses are performed. This disables parts of the core and interface. %However, in order to perform refreshes for data retention, it is necessary to interrupt the power-down mode periodically. However, the power-down mode must be interrupted periodically to perform refreshes. To avoid this, the \textit{self refresh} mode can be entered where data retention is managed by the device itself and no refresh commands need to be provided by the memory controller.% any more. % \subsection{Interface}\label{subsec:background_interface} % All modern DRAM subsystems use a bidirectional single-ended \textit{data bus} (DQ) to transfer data from the memory controller to the DRAM devices or the other way round. To sample the data at the correct time, a differential \textit{data strobe} pair (DQS\_t/DQS\_c) is provided by the driving side. Since data is sampled both at the rising and the falling edge of the data strobe (intersection of DQS\_t and DQS\_c), the bus operates at \textit{double data rate} (DDR). Commands and addresses are transferred from the memory controller to the DRAM devices over a unidirectional \textit{command/address bus} (CA). They are sampled on the edges of a differential clock signal pair (CK\_t/CK\_c) that is also driven by the memory controller. %Depending on the standard, the command/address bus is either operated at \textit{single data rate} (SDR) or double data rate. %In addition, the transfer of a single command can take one or multiple clock cycles depending on the standard because the width of the command/address bus varies. Since all modern DRAM standards operate at frequencies in the gigahertz range with data rates reaching more than 8\,Gbps/pin, the signals are terminated at the receiver side to ensure their integrity. % To increase the memory capacity of a DRAM channel, multiple devices can be connected to the same memory controller, sharing the command/address and data bus (so-called ranks). The target device is selected by the controller via a \textit{chip select} signal (CS). The physical interconnect between memory controller and DRAM can be realized in different ways, e.g., through a classical \textit{printed circuit board} (PCB), a \textit{package on package} (PoP) arrangement or a silicon interposer. All these channels have different characteristics in terms of load capacitances, reflections and loss, so they need to be modeled individually for an accurate power estimation. One special interconnection type widely used in PCs and servers is the \textit{dual inline memory module} (DIMM). Multiple DRAM chips are soldered onto a small PCB with pins on the bottom edge, which is then plugged into a socket on the main PCB. DIMMs require extra considerations for power modeling as there are different wiring topologies, off-die termination, and in some cases additional buffer chips for the command/address bus and data bus. Due to space limitations, this is not discussed in detail in this paper. % \subsection{DRAM Standards} % %\todo{Special features see Luizas Master Thesis, e.g., DBI, write X, new refresh modes etc.} Over the last quarter century, JEDEC has published more than 20 different DRAM standards. As DRAM application fields become more heterogeneous, so do the standards. Currently, there are four major families: \begin{itemize} \item \textit{DDR} is used as general-purpose memory for PCs and servers as it provides high capacities at a low cost. It can be organized as single devices or DIMMs. \item \textit{Low-Power DDR (LPDDR)} is optimized for a low power consumption and mainly used in battery-powered devices like smartphones or embedded systems. \item \textit{Graphics DDR (GDDR)} offers higher bandwidths than DDR and is mainly used in GPUs. \item \textit{High Bandwidth Memory (HBM)} provides even higher bandwidths than GDDR by utilizing a much wider data bus and a silicon interposer for connection. It is mainly used in high-performance GPUs and ASICs. \end{itemize} % The DRAM core architecture is similar across all standards and has not changed much over the years. Newer standards usually come with higher memory capacities, a slightly reduced core supply voltage and some new commands to improve the performance or power efficiency. An example of this is the different refresh modes, which will be explained in more detail in Section~\ref{subsec:refresh}. However, the interface between memory controller and devices differs greatly from standard to standard. These differences include pin data rate, pin count, termination scheme, channel loss characteristics, signaling voltage and clocking architecture. In order to achieve an accurate interface power modeling, all these differences need to be considered in the calculations. More details are provided in Section~\ref{sec:interface_power_modeling}. %\todo{In this first release of the revised version of DRAMPower, we focus on the DDR and LPDDR families, more specifically on DDR3/4/5 and LPDDR4/4X/5/5X. %GDDR and HBM will be added in a future release.} %Figure~\ref{tab:standards} provides an overview of the most important features and characteristics of each standard. % % %Interface depends on controller and devices -> not fixed for one device % % %% %DDR3: push pull termination DQ %DDR4: pseudo open drain logic (PODL) DQ, DBI, %LPDDR4: low voltage swing termination logic (LVSTL) %HBM: active inductor, doubled current (Active-Inductive CTLE continuous-time linear equalizer) % %Input/output capacitance is specified in standard % %%\begin{table*} %% \centering %% \caption{Feature Overview of JEDEC Standards} %% \label{tab:standards} %% \begin{tabular}{c|c|c|c} %% Standard & Termination & Operating Voltages & Special Features\\ %% DDR3 & SSTL & 1.5/1.5 & \\ %% DDR4 & PODL & 2.5/1.2/1.2 & Data Bus Inversion\\ %% DDR5 & PODL & 1.8/1.1/1.1 & Write Pattern, Command \& Address Inversion\\ %% LPDDR4 & LVSTL & 1.8/1.1/1.1\\ %% LPDDR4X & LVSTL & 1.8/1.1/0.6\\ %% LPDDR5 & LVSTL & 1.8/1.05/0.5 & Write Clock\\ %% \end{tabular} %%\end{table*} % %DIMM Features: %DDR3 UDIMM: Fly-By topology except for DQ/DM and DQS, Push-Pull Termination CK\_t, CK\_c, CTRL (S0\_n, ODT0, CKE0), CMD, Series resistor Rs for DQ/DM and DQS\_t/DQS\_c, other DIMMs: LRDIMM, RDIMM, SODIMM... %DDR4 UDIMM: Fly-By topology except for DQ/DM and DQS, Push-Pull Termination ... %DDR5 UDIMM: Fly-By , other DIMMs: LRDIMM, RDIMM, SODIMM, CUDIMM (clocked unbuffered)... % % \section{Core Power Modeling}\label{sec:core_power_modeling} % This section explains the modeling of core power, while the modeling of interface power is covered in the next section. Core and interface can be considered completely independent of each other because they use different supply voltages. Core power refers to the power consumed by the internal circuitry of the DRAM device, i.e., the memory arrays, sense amplifiers, row and column decoders, I/O gating and control logic. The receiver circuits at the interface are also operated with the core supply voltage and are therefore included in the core power. As the internal architecture of modern DRAM devices is very complex and highly proprietary, core power calculation cannot be based on classical network analysis. Thus, each DRAM standard defines a set of currents for fixed operating scenarios, which are listed in vendor datasheets. Based on these currents, the core power can be estimated. %%%% %%%% The following section provides an overview of these currents. % \subsection{Current Measurement Conditions}\label{subsec:current_measurement} % The minimum set specified in all DRAM standards includes the following nine currents: % \begin{itemize} \item $I_{DD0}$ (Operating one bank active-precharge current): Activate and precharge commands are sent alternately with minimum spacing.% The target bank is toggled with each activate command. \item $I_{DD2N}$ (Precharge standby current): All banks are precharged and no commands are issued. \item $I_{DD2P}$ (Precharge power-down current): All banks are precharged, no commands are issued and the device is in power-down mode. \item $I_{DD3N}$ (Active standby current): All banks are active and no commands are issued. \item $I_{DD3P}$ (Active power-down current): All banks are active, no commands are issued and the device is in power-down mode. \item $I_{DD4R}$ (Operating burst read current): All banks are active and read commands are issued with minimum spacing.% The target bank is toggled with each read command. \item $I_{DD4W}$ (Operating burst write current): All banks are active and write commands are issued with minimum spacing.% The target bank is toggled with each write command. \item $I_{DD5B}$ (Burst refresh current): Refresh commands are issued with minimum spacing. \item $I_{DD6}$ (Self refresh current): The device is in self refresh mode and the external clock is turned off. \end{itemize} % Unfortunately, the different JEDEC subcommittees, which are responsible for formulating DRAM standards, are very inconsistent in specifying the currents. %Apart from different naming schemes\footnote{To avoid confusion, we use our own naming scheme, which is a mixture of several standards.}, the measurement conditions mentioned above only apply for standards of the DDR family, while they differ for LPDDR, GDDR and HBM. Apart from different naming schemes, the measurement conditions mentioned above only apply for standards of the DDR family, while they differ for LPDDR, GDDR and HBM. For example, LPDDR measures $I_{DD3N}$, $I_{DD3P}$, $I_{DD4R}$ and $I_{DD4W}$ with only one bank active. GDDR measures $I_{DD3N}$ and $I_{DD3P}$ with one bank active, while $I_{DD4R}$ and $I_{DD4W}$ are measured with one bank in each bank group active. HBM, in turn, measures $I_{DD3N}$ and $I_{DD3P}$ with one bank active and $I_{DD4R}$ as well as $I_{DD4W}$ with all banks active. Section~\ref{subsec:bankwise} explains how these different measurement conditions are treated to achieve a universal bank-sensitive power model. Similarly, the refresh currents are also measured under varying conditions. While DDR standards specify a burst refresh current $I_{DD5B}$ for all available refresh modes, LPDDR standards specify a burst refresh current only for all-bank refresh, while for per-bank refresh, an average current $I_{DD5A}$ is provided. The difference between $I_{DD5B}$ and $I_{DD5A}$ is the spacing between two consecutive refresh commands. It is the refresh cycle time $t_{RFC}$ (i.e., the duration of a single refresh operation) for $I_{DD5B}$ and the much longer average refresh interval $t_{REFI}$ (i.e., the interval at which refresh commands need to be issued in normal operation) for $I_{DD5A}$ as shown in Figure~\ref{fig:refresh_currents}. %GDDR5/5X/6 and HBM1/2 do not specify a current for per-bank refresh at all although they support it. Section~\ref{subsec:refresh} explains how refresh power can be modeled using the provided currents of each standard. However, even if all missing currents can be calculated, the used approach for core power calculation still faces two problems, which have also been highlighted in~\cite{ghoyag_18}. First, there are large device-to-device variations, which forces the vendors to be very pessimistic when specifying operating currents. As a consequence, power is overestimated in most cases. Second, the currents are measured for fixed data and address patterns, i.e., no data dependencies and structural variations within the device are considered. If a more accurate modeling is required, the calculations have to be refined with additional device measurements. This will be demonstrated in Section~\ref{subsec:sim_accuracy}. %\todo{last subsection? extra features, maybe future work?} %\todo{multiple supply voltages!} % \subsection{Universal Bank-Sensitive Model}\label{subsec:bankwise} % The DRAM core power is composed of background power and command power. A bank-sensitive model is used for the background power, i.e. the more banks are active, the higher the power consumption. This model was already introduced in previous versions of the tool~\cite{junmat_16b,matzul_17} and provides higher accuracy compared to a model that only distinguishes between two states (either active or precharged) like the one from Micron~\cite{micron_ddr3_11_kopie_ipsj}. As shown in Figure~\ref{fig:bank_sensitive_currents} for a DRAM of the DDR family with $B$ banks, $I_{DD2N}$ is drawn when all banks are precharged and $I_{DD3N}$ is drawn when all $B$ banks are active. The span in between is not divided linearly depending on the number of active banks, but there is an offset when activating the first bank. This is due to the fact that additional logic must be switched on when the first bank is activated. % \begin{figure} \centering \resizebox{.9\linewidth}{!}{% \input{img/bankwise_current} } \caption{Bank-Sensitive Currents~\cite{junmat_16b}} \label{fig:bank_sensitive_currents} \end{figure} % $\rho$ is a vendor- and device-specific factor between 0 and 1, which can be determined by measurement~\cite{junmat_16b}. Alternatively, the pessimistic assumption of $\rho = 1$ can be made, which leads to the simplified model with only two distinct states. For standards of the DDR family, it is $I_{DD3N} = I_{\circled{B}}$, while for LPDDR, GDDR and HBM, it is $I_{DD3N} = I_{\circled{1}}$. This difference must be taken into account when calculating the background power. If the current $I_{DD2N}$, the factor $\rho$, a current $I_{DD3N}$ measured with M banks active, and the total number of banks $B$ is given, all other currents can be calculated. It is \begin{equation} I_{\circled{M}} = I_{DD2N} + (I_{\circled{B}} - I_{DD2N}) \cdot \left(\rho + (1-\rho)\cdot \frac{M}{B}\right). \end{equation} When the DRAM is in power-down mode, the dependence of the current on the number of active banks is much smaller, so we only distinguish between two states characterized by $I_{DD2P}$ and $I_{DD3P}$. The average command power is calculated by counting the number of commands of each type, adding up the energy that is consumed for all these commands, and dividing the total energy by the simulated time. As for the background power, the differences among the standards must be taken into consideration for the command power as well. In \cite{junmat_16b}, the energy for a read command $E_{RD}$ is calculated as \begin{equation} E_{RD} = V_{DD} \cdot (I_{DD4R} - I_{DD3N}) \cdot \frac{BL}{DR} \cdot t_{CK} \end{equation} where $V_{DD}$ is the core supply voltage, $BL$ is the burst length, $DR$ is the data rate and $t_{CK}$ is the clock period. For a write command, $I_{DD4R}$ is replaced with $I_{DD4W}$. However, this equation only works if $I_{DD4R}$ and $I_{DD3N}$ are measured with the same number of banks active, which is not the case for GDDR and HBM. Thus, the equations need to be adapted accordingly, i.e., for GDDR, $I_{DD3N}$ must be replaced with $I_{\circled{BG}}$ with $BG$ being the number of bank groups, while for HBM, $I_{DD3N}$ must be replaced with $I_{\circled{B}}$. % % \subsection{Refresh Power}\label{subsec:refresh} % Depending on the DRAM standard, various refresh modes are supported. They differ in the number of banks that are refreshed with a single command. All-bank refresh commands target all banks of the device at once. As no data can be accessed in banks where a refresh is in progress, this mode can cause a large drop in bandwidth. Thus, newer DRAM standards offer improved refresh modes where only a single bank (per-bank refresh), two banks (per-2-bank refresh) or one bank in each bank group (same-bank refresh) of the device are targeted with a single command, while the remaining banks can still be accessed in the meantime. The duration of a single refresh command is the refresh cycle time $t_{RFC}$, which is also the spacing of refresh commands when measuring the burst refresh current $I_{DD5B}$. Thus, when a burst refresh current is provided, the energy for a single refresh command $E_{REF}$ can be calculated as \begin{equation} E_{REF} = V_{DD} \cdot \left(I_{DD5B} - I_{\circled{M}}\right) \cdot t_{RFC} \end{equation} where $M$ is the number of refreshed banks. As the equation shows, banks with a refresh in progress are considered active, which is the most accurate way of modeling because internally the refresh is performed by successively activating multiple rows within each target bank. In the cases where only an average refresh current $I_{DD5A}$ is provided, an approximated value for $I_{DD5B}$ can be determined. Figure~\ref{fig:refresh_currents} demonstrates the relation between both refresh currents graphically, where the dashed boxes represent the energy that is consumed. The voltage is constant and can be neglected. % \begin{figure} \centering \resizebox{.85\linewidth}{!}{% \input{img/refresh_currents} } \caption{Relation between Burst Refresh Current and Average Refresh Current} \label{fig:refresh_currents} \end{figure} % \begin{figure*} \centering \begin{subfigure}[b]{0.32\linewidth} \centering \resizebox{\linewidth}{!}{% \begin{circuitikz} \ctikzset{bipoles/resistor/height=0.15} \ctikzset{bipoles/resistor/width=0.4} %\ctikzset{bipoles/tline/width=0.6} \draw (0,0) node[pmos, emptycircle, anchor=D](P){}; \draw (0,0) node[nmos, anchor=D](N){}; \draw (P.S) -- ++(0,0) node[tground](VDDQ){}; \node[anchor=south] at (VDDQ) {$V_{DDQ}$}; \draw (N.S) -- ++(0,0) node[tlground](VSSQ){}; \draw (P.G) -- (N.G); \draw (P.south west) to[short, *-o] ++(-0.5,0); \draw (P.D) to[tline=$Z_0$, bipoles/tline/width=1, *-*] ++(3,0) coordinate(TL); \draw (TL) to[R, a=$R_{TT}$] (TL|-VDDQ) node[tground](VDDQ1){}; \node[anchor=south] at (VDDQ1) {$V_{DDQ}$}; \draw (TL) -- ++(1.5,0) node[plain amp, anchor=in up, scale=0.6](recv){}; \draw (recv.bin down) to[short=$V_{ref}$, -o] ++(-0.5,0); \draw (recv.bout) to[short, -o] ++(0.5,0); \end{circuitikz}% } \caption{Pseudo Open Drain Logic (PODL)} \label{fig:term_podl} \end{subfigure} \hfill \begin{subfigure}[b]{0.32\linewidth} \centering \resizebox{\linewidth}{!}{% \begin{circuitikz} \ctikzset{bipoles/resistor/height=0.15} \ctikzset{bipoles/resistor/width=0.4} %\ctikzset{bipoles/tline/width=0.6} \draw (0,0) node[pmos, emptycircle, anchor=D](P){}; \draw (0,0) node[nmos, anchor=D](N){}; \draw (P.S) -- ++(0,0) node[tground](VDDQ){}; \node[anchor=south] at (VDDQ) {$V_{DDQ}$}; \draw (N.S) -- ++(0,0) node[tlground](VSSQ){}; \draw (P.G) -- (N.G); \draw (P.south west) to[short, *-o] ++(-0.5,0); \draw (P.D) to[tline=$Z_0$, bipoles/tline/width=1, *-*] ++(3,0) coordinate(TL); \draw (TL) to[R=$R_{TT}$] (TL|-VSSQ) node[tlground]{}; \draw (TL) -- ++(1.5,0) node[plain amp, anchor=in up, scale=0.6](recv){}; \draw (recv.bin down) to[short=$V_{ref}$, -o] ++(-0.5,0); \draw (recv.bout) to[short, -o] ++(0.5,0); \end{circuitikz}% } \caption{Low Voltage Swing Term. Logic (LVSTL)} \label{fig:term_lvstl} \end{subfigure} \hfill \begin{subfigure}[b]{0.32\linewidth} \centering \resizebox{\linewidth}{!}{% \begin{circuitikz} \ctikzset{bipoles/resistor/height=0.15} \ctikzset{bipoles/resistor/width=0.4} %\ctikzset{bipoles/tline/width=0.6} \draw (0,0) node[pmos, emptycircle, anchor=D](P){}; \draw (0,0) node[nmos, anchor=D](N){}; \draw (P.S) -- ++(0,0) node[tground](VDDQ){}; \node[anchor=south] at (VDDQ) {$V_{DDQ}$}; \draw (N.S) -- ++(0,0) node[tlground](VSSQ){}; \draw (P.G) -- (N.G); \draw (P.south west) to[short, *-o] ++(-0.5,0); \draw (P.D) to[tline=$Z_0$, bipoles/tline/width=1, *-*] ++(3,0) coordinate(TL); \draw (TL) to[R, a=$2R_{TT}$] (TL|-VDDQ) node[tground](VDDQ1){}; \node[anchor=south] at (VDDQ1) {$V_{DDQ}$}; \draw (TL) to[R=$2R_{TT}$] (TL|-VSSQ) node[tlground]{}; \draw (TL) -- ++(1.5,0) node[plain amp, anchor=in up, scale=0.6](recv){}; \draw (recv.bin down) to[short=$V_{ref}$, -o] ++(-0.5,0); \draw (recv.bout) to[short, -o] ++(0.5,0); \end{circuitikz}% } \caption{Stub Series Terminated Logic (SSTL)} \label{fig:term_sstl} \end{subfigure} % \caption{DRAM Interface Termination Schemes} \label{fig:term} \end{figure*} % From the definitions of the two currents, we know that within a refresh interval $t_{REFI}$, the burst refresh energy and the average refresh energy are identical. This relationship can be translated into the following equation to calculate $I_{DD5B}$ from $I_{DD5A}$: \begin{equation} I_{DD5B} = I_{DD2N} + \left(I_{DD5A} - I_{DD2N}\right) \cdot \frac{t_{REFI}}{t_{RFC}} \end{equation} % \section{Interface Power Modeling}\label{sec:interface_power_modeling} % Interface power refers to the power consumed by the drivers for the communication between memory controller and DRAM devices. In contrast to the core power, which is fixed for a specific device, the interface power depends on the complete DRAM subsystem architecture, i.e., the \textit{physical layer} (PHY) of the memory controller, the channel architecture (number of ranks, possible usage of DIMMs, etc.), the channel characteristics (e.g., channel loss and parasitic capacitances) and the DRAM PHYs. Thus, a modeling based on the operating currents specified in vendor datasheets is not possible as they are only measured for one specific subsystem architecture. Instead, we calculate the interface power based on an equivalent circuit diagram of the real interface architecture as is also done by CACTI-IO. Interface power can be divided into \textit{termination power}, which is dissipated across the termination resistances required for signal integrity, and \textit{dynamic power}, which is dissipated through the lossy charging and discharging of parasitic capacitances and the signaling over a lossy transmission line. In the following two sections, the calculation of termination power and dynamic power is explained. % \subsection{Termination Power} % The termination power depends on the termination scheme and the number of logic zeros and ones transmitted, but it is independent of the operating frequency. There are three commonly used termination schemes for DRAM, shown in Figure~\ref{fig:term} for a simple point-to-point connection. \textit{Pseudo open drain logic} (PODL) and \textit{low voltage swing terminated logic} (LVSTL) only use a pull-up or a pull-down resistor for termination, respectively. \textit{Stub series terminated logic} (SSTL) uses both a pull-up and a pull-down resistor. In all three cases, the termination resistance is matched the characteristic impedance of the transmission line, i.e., $R_{TT} \approx Z_0$ (remember that in AC analysis a DC voltage source is treated as a short). To calculate the power, both logic levels are considered separately. The transistor of the driver that is switched on is replaced with an equivalent resistor with resistance $R_{ON}$, while the transistor that is switched off is replaced with an open circuit. As an example, Figure~\ref{fig:terminations} shows the two equivalent circuit diagrams for a PODL interface. % \begin{figure} \centering % \begin{circuitikz} % \ctikzset{bipoles/resistor/height=0.15} % \ctikzset{bipoles/resistor/width=0.4} % \draw (0,0) % node[tground](VDDQ1){} % to [R=$R_{ON}$] ++(0,-1.5) coordinate(x1) % to [short=$"1"$, name={s1}] ++(2,0) coordinate(x2) % to [R,a=$R_{TT}$] ++(0,1.5) node[tground](VDDQ2){}; % \node[anchor=south] at (VDDQ1) {$V_{DDQ}$}; % \node[anchor=south] at (VDDQ2) {$V_{DDQ}$}; % \draw(x2) to [open] ++(1.75,0) coordinate(x3) % to [R=$R_{ON}$] ++(0,-1.0) node[ground](x4){}; % \draw(x3) % to [short=$"0"$, name={s2}] ++(2,0) % to [R,a=$R_{TT}$] ++(0,1.5) node[tground](VDDQ3){}; % \node[anchor=south] at (VDDQ3) {$V_{DDQ}$}; % \path(x4) ++(0,-1.0) coordinate(x5); % \draw(s1|-x5) node[](){\bfseries (a) Driving Logic "1"}; % \draw(s2|-x5) node[](){\bfseries (b) Driving Logic "0"}; % \end{circuitikz}% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \begin{subfigure}[b]{0.37\linewidth} \centering \resizebox{\linewidth}{!}{% \begin{circuitikz} \ctikzset{bipoles/resistor/height=0.15} \ctikzset{bipoles/resistor/width=0.4} \draw (0,0) node[tground](VDDQ1){} to [R=$R_{ON}$] ++(0,-1.5) to [short=$"1"$] ++(3,0) coordinate(foo) to [R,l=$R_{TT}$] ++(0,1.5) node[tground](VDDQ2){}; \node[anchor=south] at (VDDQ1) {$V_{DDQ}$}; \node[anchor=south] at (VDDQ2) {$V_{DDQ}$}; \draw[white](foo) to [R] ++(0,-1.5) node[tlground](VDDQ){}; \end{circuitikz}} \caption{Driving Logic "1"} \label{fig:term_logic_1} \end{subfigure} % \hspace{15pt} % \begin{subfigure}[b]{0.34\linewidth} \centering \resizebox{\linewidth}{!}{% \begin{circuitikz} \ctikzset{bipoles/resistor/height=0.15} \ctikzset{bipoles/resistor/width=0.4} \draw (0,0) node[tlground]{} to[R,a=$R_{ON}$] ++(0,1.5) to[short=$"0"$] ++(3,0) to[R,l=$R_{TT}$] ++(0,1.5) node[tground](VDDQ){}; \node[anchor=south] at (VDDQ) {$V_{DDQ}$}; \end{circuitikz}} \caption{Driving Logic "0"} \label{fig:term_logic_0} \end{subfigure} \caption{Equivalent Circuit Diagrams for PODL Termination Power} \label{fig:terminations} \end{figure} % When driving a logic one, both ends of the circuit are connected to $V_{DDQ}$, which means that no current is flowing and no power is dissipated, i.e., \begin{equation} P_{term,1}^{PODL} = 0. \end{equation} In contrast, when driving a logic zero, one side is connected to ground, while the other side is connected to $V_{DDQ}$. The dissipated power is calculated as \begin{equation} P_{term,0}^{PODL} = \frac{V_{DDQ}^2}{R_{ON} + R_{TT}}. \end{equation} In the case of an LVSTL interface, the equations for both logic levels are reversed. %, i.e., %\begin{equation} %P_{term,0}^{LVSTL} = 0 %\end{equation} %and %\begin{equation} %P_{term,1}^{LVSTL} = \frac{V_{DDQ}^2}{R_{ON} + R_{TT}}. %\end{equation} The SSTL interface uses both a pull-up and a pull-down resistor, therefore, power is dissipated at both logic levels. It can be calculated as \begin{equation} P_{term,0}^{SSTL} = P_{term,1}^{SSTL} = \frac{V_{DDQ}^2}{(R_{ON}||2R_{TT})+2R_{TT}}. \end{equation} % The average termination power when transmitting $n_0$ logic zeros and $n_1$ logic ones is \begin{equation}\label{eq:term_total} P_{term} = \frac{P_{term,0} \cdot n_0 + P_{term,1} \cdot n_1}{n_0 + n_1} \end{equation} %dissipated termination energy depends on the bit period $t_b$ (minimum time that signal is at one logic level) and the number of transmitted logic zeros $n_0$ and logic ones $n_1$. %It is calculated as %\begin{equation} % E_{term} = (P_{term,0} \cdot n_0 + P_{term,1} \cdot n_1) \cdot t_b. %\end{equation} Because with PODL and LVSTL only one logic level consumes power, data bus inversion can be used to reduce the termination power consumption. With SSTL, the termination power is independent of the transmitted data. For channel configurations with multiple ranks or DIMMs, the interconnect network can change from a simple point-to-point topology to a more complex topology, e.g., because the non-target dies also terminate the bus. In these cases, termination power can be calculated in the same way by determining the equivalent circuit diagrams for both logic levels. % %Figures show simplest networks consisting of driver with pull-up and pull-down on resistance $R_{ON}$, lossless transmission line with impedance $Z_0$ and termination resistance $R_{TT}$. %$R_{TT}$ is chosen to match $Z_0$ and typically has a value of $50 \Omega$. %For push pull termination, static termination power is consumed both when logic 1 and 0 is transmitted and it is calculated as %\begin{equation} % P_{term} = \frac{V_{DDQ}^2}{4 \cdot (R_{ON} + R_{TT})}. %\end{equation} %In contrast, for POD static termination power is only consumed when logic 0 is transmitted and for LVSTL static termination power is consumed when logic 1 is transmitted. %This has the advantage that idle bus can be "parked" at either 0 or 1 and no static power is consumed. %It is calculated as %\begin{equation} % P_{term} = \frac{V_{DDQ}^2}{R_{ON} + R_{TT}} %\end{equation} % %Core power is fixed for specific device -> calculation based on currents specified in datasheets %Interface power depends on Controller PHY, PCB, socket etc. -> calculation based on physical laws %Since supply voltage of drivers is separated from supply voltage of core, calculation can be split up like this %% kurz: was ist DRAM %%% wie spielen verschiedene Standards da rein %%% unterschiede in specs bei versch. herstellern / modellen %%% instruktionen u. korrelation zu energie verbrauch % %\input{content/04_drampower} % \subsection{Dynamic Power} % As shown in the previous section, termination power is frequency independent because it is dissipated across a purely resistive network. Termination power represents a lower bound for the total power consumption and also dominates at low operating frequencies. However, since current generation DRAM standards support data rates of 8\,Gbps/pin and more, the impact of parasitic capacitances is much more significant. Figure~\ref{fig:load_caps} shows the simple point-to-point connection with PODL termination scheme as already presented in Figure~\ref{fig:term_podl}, but with two added parasitic capacitances, one at the driver side and one at the receiver side. % \begin{figure} \centering \resizebox{.9\linewidth}{!}{% \begin{circuitikz} \ctikzset{bipoles/resistor/height=0.15} \ctikzset{bipoles/resistor/width=0.4} %\ctikzset{bipoles/tline/width=0.6} \draw (0,0) node[pmos, emptycircle, anchor=D](P){}; \draw (0,0) node[nmos, anchor=D](N){}; \draw (P.S) -- ++(0,0) node[tground](VDDQ){}; \node[anchor=south] at (VDDQ) {$V_{DDQ}$}; \draw (N.S) -- ++(0,0) node[tlground](VSSQ){}; \draw (P.G) -- (N.G); \draw (P.south west) to[short, *-o] ++(-0.5,0); \draw (P.D) to[short,*-*] ++(0.8,0) coordinate(D1) to[tline=$Z_0$, bipoles/tline/width=1, *-*] ++(3,0) coordinate(D3) to[short,-*] ++(0.8,0) coordinate(D4); \draw (D4) to[R,a=$R_{TT}$] (D4|-VDDQ) node[tground](VDDQ1){}; \node[anchor=south] at (VDDQ1) {$V_{DDQ}$}; \draw (D4) -- ++(1.5,0) node[plain amp, anchor=in up, scale=0.6](recv){}; \draw (recv.bin down) to[short=$V_{ref}$, -o] ++(-0.5,0); \draw (recv.bout) to[short, -o] ++(0.5,0); \draw (D1) to[C=$C_{TX}$] (D1|-VSSQ) node[tlground]{}; \draw (D3) to[C,a=$C_{RX}$] (D3|-VSSQ) node[tlground]{}; \end{circuitikz}} \caption{Point-to-Point Connection with Parasitic Caps} \label{fig:load_caps} \end{figure} % We analyze the power dissipation of this circuit for different operating frequencies as input using SPICE. The components are dimensioned as $R_{ON}$ = \SI{48}{\ohm}, $R_{TT}$ = \SI{60}{\ohm}, $C_{TX}$ = $C_{RX}$ = \SI{1}{\pico\farad} and $V_{DDQ}$ = \SI{1.1}{\volt}, which is in the order of a real DDR5 interface. For now, the transmission line losses are also modeled with a parasitic capacitance with $C_{TL}$ = \SI{2}{\pico\farad}. % %\begin{figure} % \centering % \begin{tikzpicture} % \begin{axis}[ % xlabel={Operating Frequency [MHz]}, % ylabel={Power Dissipation [mW]}, % xmode=log, % xmin=50, % xmax=6400, % xtick={50,100,200,400,800,1600,3200,6400}, % xticklabels={50,100,200,400,800,1600,3200,6400} % ] % \addplot coordinates {(50,5.02) (100,5.05) (200,5.11) (400,5.22) (800,5.42) (1600,5.8) (3200,6.46) (6400,7.09)}; % \end{axis} % \end{tikzpicture} % \caption{Caption} % \label{fig:enter-label} %\end{figure} % At a frequency of \SI{100}{\mega\hertz}, the dissipated power is \SI{5.7}{\milli\watt}, which is close to the termination power of the circuit of \SI{5.6}{\milli\watt}. However with increasing frequencies, the power also increases because the capacitors start to conduct. At \SI{1600}{\mega\hertz} (i.e., DDR5-3200), the dissipated power is already \SI{8.6}{\milli\watt}, i.e., over \SI{50}{\percent} higher than the pure termination power. To calculate the power dissipation analytically, the clock signal with frequency $f$ and voltage swing $V_{DDQ}$ can be expressed as a Fourier series \begin{equation} v(t) = \frac{V_{DDQ}}{2} + \Re \left\{\frac{-2j \cdot V_{DDQ}}{\pi} \sum_{k=1,3,5,\dots}^{\infty} \frac{1}{k} \exp(j 2 \pi f k t)\right\}. \end{equation} with DC component $\frac{V_{DDQ}}{2}$. % The complex amplitudes $\underline{\hat{V}}_k$ of the frequency components can be directly determined from this equation as \begin{equation} \underline{\hat{V}}_k = \frac{-2j \cdot V_{DDQ}}{\pi} \cdot \frac{1}{k}. \end{equation} With the frequency-dependent complex impedances $\underline{Z}_k$ calculated as \begin{equation} \underline{Z}_k = R_{ON} + \frac{1}{j 2 \pi f k (C_{TX} + C_{RX} + C_{TL}) + \frac{1}{R_{TT}}}, \end{equation} the DC resistance $R_{DC}$ calculated as \begin{equation} R_{DC} = R_{ON} + R_{TT}, \end{equation} and the voltage across $R_{DC}$ calculated as \begin{equation} V_{DC} = V_{DDQ} - \frac{V_{DDQ}}{2} = \frac{V_{DDQ}}{2}, \end{equation} the total power dissipation $P_{total}$ can be calculated as \begin{equation}\label{eq:fourier} P_{total} = \frac{V_{DC}^2}{R_{DC}} + \sum_{k=1,3,5,\dots}^{\infty} \frac{|\underline{\hat{V}}_k|^2}{2} \cdot \Re \left\{\frac{1}{\underline{Z}_k}\right\}. \end{equation} % In reality, the series needs to be terminated at a certain $k$, which can be chosen to match the finite slew rate of the signal. For LVSTL, the same equations can be applied, while for SSTL, the calculation of the DC component needs to be adapted. The dynamic power $P_{dyn}$, which adds to the termination power due to the toggling between both logic levels, is finally calculated as \begin{equation} P_{dyn} = P_{total} - P_{term}. \end{equation} One alternative formula, which is often used to approximate the dynamic power $P_{dyn}$, is given by \begin{equation}\label{eq:approx} P_{dyn} = \left(\sum_i C_i \cdot V_{sw,i}\right) \frac{V_{DDQ} \cdot f}{2} \end{equation} where $C_i$ are the capacitances along the channel and $V_{sw,i}$ are the respective voltage swings at each capacitance~\cite{bak_90,joukah_15}. The voltage swings are usually determined using a DC analysis for both logic levels. This is also done by CACTI-IO. While this approximation provides accurate results at low operating frequencies, current generation DRAM interfaces do not reach full voltage swing anymore due to the large parasitic capacitances in combination with high operating frequencies. Figure~\ref{fig:power_comp} shows the total power dissipation of the previous circuit (Figure~\ref{fig:load_caps}) at different operating frequencies calculated with SPICE, Equation~\ref{eq:fourier} and Equation~\ref{eq:term_total} plus Equation~\ref{eq:approx}. % \begin{figure} \centering \resizebox{.9\linewidth}{!}{% \begin{tikzpicture} \begin{axis}[ axis equal image, xlabel={Operating Frequency [MHz]}, ylabel={Power Dissipation [mW]}, %xmode=log, xmin=0, xmax=16, xtick={2,4,6,8,10,12,14}, xticklabels={100,200,400,800,1600,3200,4200}, ymin=0, ymax=12, ybar, bar width=2mm, legend pos=north west ] \addplot+ coordinates {(2,5.7) (4,5.9) (6,6.2) (8,6.8) (10,7.75) (12,8.6) (14,8.8)}; \addplot+ coordinates {(2,5.7) (4,5.9) (6,6.2) (8,6.8) (10,7.75) (12,8.6) (14,8.8)}; \addplot+ coordinates {(2,5.7) (4,5.9) (6,6.1) (8,6.7) (10,7.75) (12,9.9) (14,11.2)}; \legend{SPICE, Fourier Series (This Work), Approximation (CACTI-IO)} \end{axis} \end{tikzpicture}% } \caption{Comparison of Different Calculation Methods for Power Dissipation} \label{fig:power_comp} \end{figure} % While the Fourier series based formula consistently provides the same results as SPICE, the approximate formula is accurate at low frequencies, but overestimates the power dissipation at higher frequencies, e.g., by \SI{15}{\percent} at \SI{3200}{\mega\hertz} (DDR5-6400) and even \SI{27}{\percent} at \SI{4200}{\mega\hertz} (DDR5-8400, the currently highest specified data rate of the standard). The loss characteristic of the transmission line can be handled in different ways. In \cite{holsta_19}, the authors have analyzed various physical DRAM interfaces, i.e., multi DIMM, package on package, PCB trace and silicon interposer. They show that the channels have very distinct insertion loss characteristics, which need to be taken into consideration for an accurate power estimation. A linear loss characteristic can be approximated with an additional capacitance, while more complex loss characteristics can be modeled with frequency-dependent impedance values in the Fourier series based calculation. Up until now, the formulas for dynamic power consumption assume a switching activity of $\alpha = 1$, i.e., the signals transition from logic zero to logic one once every period. While this is true for clock and data strobe signals, the command/address bus and data bus usually experience lower switching activities. Especially when a signal is only operated at SDR, the switching activity is limited to $\alpha_{max} = 0.5$. The problem is that the switching activity $\alpha$ and number of transmitted zeros $n_0$ and ones $n_1$ alone do not determine the complete signal behavior, which is demonstrated in Figure~\ref{fig:switching_signals}. % \begin{figure} \centering \resizebox{.85\linewidth}{!}{% \input{img/switching_signals} } \caption{Two Different Signals with Identical $\alpha$, $n_0$ and $n_1$} \label{fig:switching_signals} \end{figure} % Both S1 and S2 have a switching activity of $\alpha = 0.5$ and the number of transmitted zeros and ones is $n_0 = n_1 = 8$. However, S1 operates at half the clock frequency for the whole time, while S2 operates at the full clock frequency in the beginning and only one fifth of the clock frequency in the end. When the dissipated power is calculated section by section using Equation~\ref{eq:fourier}, the results for S1 and S2 differ because different voltage swings are reached in each section. In the corner cases, a signal with switching activity $\alpha$ can be either modeled with a constant switching activity for the whole time or with a maximum switching activity $\alpha_{max}$ for one part of the time and a switching activity of 0 for the other part of the time. The actual dynamic power consumption lies between these two corner cases and can be approximated by the mean value \begin{equation} \overline{P}_{dyn}(\alpha) = \frac{P_{dyn}(f=\alpha \cdot f_{max}) + \frac{\alpha}{\alpha_{max}} \cdot P_{dyn}(f = \alpha_{max} \cdot f_{max})}{2}. \end{equation} % Finally, the switching activity $\alpha$ can be determined by counting the number of zero to one transitions $n_{0 \rightarrow 1}$ in a given time interval $\tau$ as \begin{equation} \alpha = \frac{n_{0 \rightarrow 1}}{\tau \cdot f_{max}}. \end{equation} % % \section{Simulator Overview}\label{sec:simulator} % This section provides a short introduction to the internal software architecture of DRAMPower~5. Afterwards, the simulation speed and simulation accuracy are evaluated. % \subsection{Software Architecture} % The new version of DRAMPower is not designed as a standalone simulator, but as a library that is coupled to a DRAM subsystem simulator which models the memory controller and translates incoming read and write requests into DRAM commands. Alternatively, a DRAM command trace can be provided as an input file. For the interface power calculation, the provided commands, addresses and data are translated into equivalent bit patterns using the command truth table of the simulated standard. Based on this data, the number of transmitted zeros $n_0$, transmitted ones $n_1$ and zero to one transitions $n_{0 \rightarrow 1}$ can be calculated. To achieve high simulation speeds, bit manipulation instructions including the population count (\texttt{POPCNT}) instruction are used. Instead of real data, it is also possible to provide a switching activity $\alpha$ and a duty cycle $D$ (ratio between logic one and logic zero). In addition to the command/address and data bus, the remaining signals like the clock signal pair, data strobe pairs or chip select need to be considered (see Section~\ref{subsec:background_interface}). %As explained in Section~\ref{sec:interface_power_modeling}, the interface power calculation can depend on lots of parameters and, thus, can become very complex. %\todo{In order to avoid the complexity within DRAMPower, the tool only receives the precalculated termination and dynamic power values for all signals as inputs.} %These calculations need to be carried out externally using the provided equations. The core power calculation is more complex because in addition to counting the number of issued commands of each type, DRAMPower needs to keep track of the clock cycles that the DRAM is in a specific state (i.e., 0 - B banks active, active/precharge power-down, self refresh). If a configuration with multiple ranks is simulated, the counting has to be done separately for each rank. A further difficulty arises from the fact that the internal state is not always changed immediately by an external command, but it can also change after a certain delay. An example for this behavior is shown in Figure~\ref{fig:implicit_commands}. % \begin{figure} \centering \resizebox{\linewidth}{!}{% \input{img/implicit_commands} } \caption{Example for Implicit Command} \label{fig:implicit_commands} \end{figure} % When a \textit{read with auto precharge} command (\texttt{RDA}) is issued, the target bank is automatically precharged after the read to precharge delay $t_{RTP}$ has expired. This means that the DRAM will internally issue what we call an \textit{implicit command} in the future. Unfortunately, DRAMPower is not based on an event-driven simulation kernel like SystemC where an event can be directly notified in the future. Instead, it is only triggered from the outside when new commands are issued, so the implicit commands need to be handled differently. The actions that are performed by an implicit command are formulated as a lambda expression, which is stored in an internal list ordered by the time stamp of execution. Whenever DRAMPower is triggered from the outside, first, this list is searched from the beginning for implicit commands with time stamps less than or equal to the current simulation time. The lambda expressions of these list entries are then evaluated before the external command is handled. The total power consumption can be queried at any time even when the simulation is still running, which allows to analyze the change of power consumption over time. % %\subsection{Simulation Kernel} %% %Windowing: Power can be evaluated during running simulation -> power over time is possible %Handling implicit commands: %Examples: Power-Down Entry is not done when command is issued, but might be delayed %RDA/WRA: auto-precharge is done after RD/WR is internally completed or only after tRAS is expired %when command is issued, implicit command (lambda) is inserted into deque of implicit commands that is ordered by timestamp %before we execute a new command or we request the window stats, we check if there are still outstanding requests in the implicit command queue with a timestamp smaller or equal to the current time %% %DRAMPower does not use an event-driven simulation kernel, but it is only triggered externally when new commands are issued or when the total energy up to a certain point/the current time is requested. %However, there is the case that a command that is issued at time $t$ only triggers an internal action/operation at time $t+x$. %Thus, DRAMPower internally uses a queue that consists of a pair of a timestamp and a lambda expression. %When a command is issued that triggers an action in the future, a lambda expression with the respective timestamp is inserted in the queue. %Whenever a new command is issued or the total energy is requested, it is first checked whether there are entries in the queue with a timestamp less or equal to the current timestamp. %These lambdas are then first evaluated. %% %\begin{figure} % \centering % \resizebox{\linewidth}{!}{% % \input{img/implicit_commands} % } % \caption{Example for Implicit Command} % \label{fig:implicit_commands} %\end{figure} % %Windowing: Power can be evaluated during running simulation -> power over time is possible % %No standalone simulator, but coupled to e.g. DRAMSys %\todo{ranks} %\todo{count 1, 0 and 0->1 based on issued commands and data, alternatively use average values} %\todo{count commands and clock cycles in each state for background power} % %The simulation kernel of DRAM Power uses a timestep based systems, to create a cycle accurate depiction of memory accesses. Different DRAM Standards are modeled as different classes inside the source code, to more accurately depict differences in DRAM behaviour. % % %The kernel takes as input a Memory Specification (MemSpec) file and a command list. MemSpecs are a machine readable representation of a DRAM's spec sheet formatted as JSON. The command list contains traces of DRAM commands sent to the DRAM controller with corresponding timestamps, which will be processed during simulations. The command list can either be created manually or supplied in form of an input file, or from external tools directly, like simulation traces from DRAMSys. % %The simulation starts at timestamp t = 0 and iteratively processes each single command from the command list. Certain commands can issue following commands at a delayed cycle relative to their own execution. [braucht beispiel] Those deferred commands are referred as implicit commands inside DRAMPower and are inserted back into a command queue with a given timestamp. During every simulation step, the kernel checks if the command queue has pending implicit commands and executes them according to their timestamp. % % %DRAM Standards inside DRAMPower are programmatically modelled as classes. Since only a handful of behaviors are shared between DRAM Standards, each standard warrants its own implementation inside DRAMPower. Every implemented DRAM Standards inherits from a common base class, which handles all interaction with the kernel. The kernel dispatches commands to the instanced DRAM class, which then routes them through it's own function table, where commands are associated with implemented functions inside the DRAM class. % % %DRAMPower is also able to calculate interface power consumption. This is being achieved by simulating a bit accurate depiction of the command and data busses of a DRAM device. Each command of a given DRAM standard has a specified bit pattern, which is used by the controller to distinguish between commands. During execution the bits on the command bus constantly change, since the data bus is being overwritten with every incoming new command. This means, that the bits on the command bus can flip between cycles, thus leading to increases in power consumption. The same effect applies to the data bus as well, which is used to handle read and write commands. % %% %% %\subsection{Interface Power Calculation} %%% %Physical equations from section ..., %power depends on command, address and data because the number of transmitted 0/1/toggles changes %termination power -> number of transmitted 0 and 1, efficiently calculated using population count (POPCNT) command % \subsection{Simulation Speed} % Since DRAMPower is not operated as a standalone tool in the normal use case, but rather coupled to a behavioral DRAM subsystem simulator, we evaluate its simulation speed in terms of the overhead of adding power simulation. For this analysis, DRAMPower is coupled with DRAMSys~\cite{stejun_20}, a well-known DRAM subsystem simulator, and executed on a server with two Intel Xeon Silver 4210R processors. Within DRAMSys, one million read and write requests with random addresses and data are generated. % TODO The data is always 0 This simulation is carried out both with and without power simulation enabled. Moreover, the simulations are also performed without actual data. In this case, DRAMPower is provided with a switching activity $\alpha$ and a duty cycle $D$. For the simulations with data, DRAMSys alone requires on average \SI{9.10}{\second} to finish, while with added power simulation, the average simulation time increases to \SI{11.37}{\second}. This corresponds to an overhead of \SI{25}{\percent}. When no data is simulated, DRAMSys alone requires on average only \SI{6.95}{\second} to finish, while with DRAMPower enabled, the simulation time increases to \SI{8.44}{\second}. In this case, the overhead is \SI{21}{\percent}. While this overhead may seem relatively large at first glance, there are two things to consider. Firstly, DRAMSys is highly optimized for simulation speed as was already shown in~\cite{stejun_20}. Secondly, if a full system simulation is performed where DRAMSys is additionally coupled to a much slower processor simulator such as gem5, the overhead of adding DRAMPower becomes negligible. % %\begin{figure} % \centering % \resizebox{\linewidth}{!}{% % \input{img/benchmark_plot} % } % \caption{DRAMSys Benchmarks} % \label{fig:benchmark_plot} %\end{figure} % % %DRAMPower not standlone, simulated together with DRAMSys. DRAMSys is already fast (ref paper DRAMSys4.0), we have benchmarked DRAMPower coupled to DRAMSys, overhead of DRAMPower negligible. %if we couple additionally to core simulator (e.g., gem5), overhead is even smaller. %The benchmarks in figure~\ref{fig:benchmark_plot} show the overhead of drampower for a simulation with 1,000,000 requests. The benchmarks suffixed "nostore" are simulated without data. DRAMPower uses a toggling rate for calculating the databus energy. %\todo{DRAMPower popcnt. Comparison vector to std::bitset?} %\todo{Marco: Vielleicht kannst du hier ein paar Zahlen zur Simulationsgeschwindigkeit nennen, erstens bzgl. POPCNT und vielleicht auch zweitens im Vergleich zu DRAMSys, damit man sieht, dass die Simulationszeit von DRAMPower eigentlich nicht ins Gewicht fällt.} %dynamic power -> number of 0-1 toggles, calculated as (not p and q) %alternatively, duty cycle/toggling rates can be used % (drampower lässt sich unterteilen in zwei aspekte: statisch und dynamisch) %% statisch: wie sind die versch. standards implementiert %%% standard -> instruction set %%% mapping von bitcode auf instruction ( 011010101 -> REF ) %%% mapping instruction auf function ( REF -> DDR5::handle_ref() ) %%% formeln zur strom berechnung %% dynamisch: ausführung von simulation %%% liste von instructions -> timestamp basierte simulation %%% implizite commands %%% sammeln von countern, berechnung, ausgabe von stromverbrauch % Interface % PARC %\subsection{Modeling New Refresh Commands} %% %banks in refresh are considered active during refresh, device is in active mode (I\_rho + ...) %all-bank refresh: IDD5B - IDD3N %% %\subsection{Core Power} %% %new refresh commands without specified burst refresh current, only average refresh current %% %\input{content/05_exp_results} \subsection{Simulation Accuracy}\label{subsec:sim_accuracy} % %\todo{ %Interface -> comparison with SPICE, maybe use a random pattern in spice with fixed n0, n1 and alpha %Core -> we do not yet have a measurement platform for DDR5/LPDDR5/HBM3... where we can issue specific command patterns to DRAM and compare it with the results provided by DRAMPower. %} % IDD Patterns mit Daimler Messung vergleichen To verify the power estimates of the new DRAMPower implementation, we use core and interface power measurements\footnote{The measurements do not include the interface power of the memory controller PHY.} of LPDDR4 devices from three different vendors, as reported in a study of a memory measurement platform~\cite{feldmann_23}. Unfortunately, no measurement results for a newer standard are publicly available. Each DRAM is operated with six different access patterns, which are analogous to the following $I_{DD}$ currents: % \tikz{\node[circle,draw,inner sep=1pt] {\tiny 1}}~$I_{DD0*}$, \tikz{\node[circle,draw,inner sep=1pt] {\tiny 2}}~$I_{DD4R}$, \tikz{\node[circle,draw,inner sep=1pt] {\tiny 3}}~$I_{DD4W}$, \tikz{\node[circle,draw,inner sep=1pt] {\tiny 4}}~$I_{DD5B}$, \tikz{\node[circle,draw,inner sep=1pt] {\tiny 5}}~$I_{DD2N}$ and \tikz{\node[circle,draw,inner sep=1pt] {\tiny 6}}~$I_{DD6}$. % As it was not possible to reproduce the usual $I_{DD0}$ pattern of activate-precharge for the measurement platform, $I_{DD0*}$ is a variation using an activate-read-precharge pattern, which is also resembled in the DRAMPower simulation. In addition, the platform could not accurately measure the operating burst write current $I_{DD4W}$ because only one write request could be issued at a time. Thus, the simulation was also configured to limit the number of outstanding write requests to one. The initial simulations are based on the current values specified in the vendor datasheets. Then, based on the actual measurements, the current values are reapplied to a second simulation. The results are shown in Figure~\ref{fig:power_plot}. % \begin{figure} \centering \resizebox{.89\linewidth}{!}{% \input{img/power_plot} } \caption{Average Power Consumption of Simulations and Measurements for Different Vendors} \label{fig:power_plot} \end{figure} % As it can be seen, the currents specified in the datasheets are overly pessimistic for all three vendors: The simulations based on the datasheets show on average a \SI{2.9}{\times} higher power consumption than the actual measurements. However, when the measured currents are applied to the simulation, the deviation drops to only around \SI{18.8}{\percent}. The largest share is caused by $I_{DD0*}$. For this pattern, it is unclear whether the measurement platform was actually able to fully saturate the memory controller's buffer and therefore reports a lower average power consumption than the simulations. Without $I_{DD0*}$, the deviation is only \SI{2.8}{\percent}. This again highlights that truly accurate core power simulations are only possible with measured currents, while datasheet values provide a worst-case estimate. %\todo{This deviation is caused by the interface power modeling ($I_{DD4R}$ and $I_{DD4W}$) because the interface parasitics are estimated values.} % % \section{Conclusion and Future Work}\label{sec:conclusion} % In this paper, we have presented DRAMPower~5, a power simulator for current generation DRAM standards. It uses newly developed core and interface power models to flexibly support different standards and accurately capture the effects of high operating frequencies. %DRAMPower~5 is open source and available on GitHub. In the future, we will continue to update the simulator to emerging standards and new features. %This will also include pulse-amplitude modulation (PAM) signaling, which is already used in GDDR6X and GDDR7, and poses new challenges for the interface power modeling. % \section*{Acknowledgements} This work was funded in part by the German Federal Ministry of Education and Research (BMBF) under grants 16ME0935, 16ME0936 and 16ME0934K (\mbox{DI-DERAMSys}) as well as grants 16ME0717 and 16ME0716K (MANNHEIM-MEMTONOMY). %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %% Footer %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %% %% The next two lines define the bibliography style to be used, and %% the bibliography file. \bibliographystyle{ACM-Reference-Format} \bibliography{drampower} %\input{drampower-appendix} \end{document} \endinput