Files
drampower-paper/drampower-main.tex
2024-11-14 15:32:22 +00:00

1280 lines
76 KiB
TeX

%% Commands for TeXCount
%TC:macro \cite [option:text,text]
%TC:macro \citep [option:text,text]
%TC:macro \citet [option:text,text]
%TC:envir table 0 1
%TC:envir table* 0 1
%TC:envir tabular [ignore] word
%TC:envir displaymath 0 word
%TC:envir math 0 word
%TC:envir comment 0 0
%%
%%
%% The first command in your LaTeX source must be the \documentclass
%% command.
%%
%% For submission and review of your manuscript please change the
%% command to \documentclass[manuscript, screen, review]{acmart}.
%%
%% When submitting camera ready or to TAPS, please change the command
%% to \documentclass[sigconf]{acmart} or whichever template is required
%% for your publication.
%%
%%
\documentclass[sigconf, anonymous, review, nonacm=true]{acmart}
%\documentclass[sigconf]{acmart}
%%
%% \BibTeX command to typeset BibTeX logo in the docs
\AtBeginDocument{%
\providecommand\BibTeX{{%
Bib\TeX}}}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Document Settings
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%\input{drampower-settings}
%%
%% Set the title of the paper here
\newcommand{\papertitle}{
DRAMPower~5: An Open-Source Power Simulator for Current Generation DRAM Standards
}
%%
%% Conference settings
%\copyrightyear{2025}
%\acmYear{2025}
%\setcopyright{acmlicensed}\acmConference[RAPIDO 2025]{Proceedings of the 2023 Workshop on System Engineering for constrained embedded systems}{January 17--18, 2023}{Toulouse, France}
%\acmBooktitle{Proceedings of the 2023 Workshop on System Engineering for constrained embedded systems (RAPIDO 2023), January 17--18, 2023, Toulouse, France}
%\acmPrice{15.00}
%\acmDOI{10.1145/3579170.3579259}
%\acmISBN{979-8-4007-0045-3/23/01}
%%
%% Paper logo settings
\newcommand{\paperlogo}{
\begin{teaserfigure}
\includegraphics[width=\textwidth]{figs/logo_drampower_5_0.png}
\caption{DRAMPower 5.0 Logo}
\Description{DRAMPower 5.0 Logo.}
\label{fig:logo}
\end{teaserfigure}
}
%%
%% Keyword settings
\newcommand{\paperkeywords}{\keywords{
DRAMPower, DRAM, power, simulation, interface
}}
\usepackage{subcaption}
\usepackage{siunitx}
\usepackage{tikz}
\usetikzlibrary{patterns,arrows,decorations.pathreplacing}
\usetikzlibrary{arrows.meta}
%\usetikzlibrary{arrows,automata}
\usetikzlibrary{positioning}
\usetikzlibrary{positioning,shadows,trees}
\usepackage{pgfplots}
\usepackage{pgfplotstable}
\usepgfplotslibrary[groupplots]
\usepackage{tikz-timing}[2009/12/09]
\usetikztiminglibrary{overlays}
%%% Timing Diagram Setup %%%
%Define different DRAM commands:
\tikztimingmetachar{A}{1.0D{\texttt{ACT}}}
\tikztimingmetachar{P}{1.0D{\texttt{PRE}}}
\tikztimingmetachar{X}{1.0D{\texttt{DES}}}
\tikztimingmetachar{R}{1.0D{\texttt{RDA}}}
\tikztimingmetachar{W}{1.0D{\texttt{WR}}}
\tikztimingmetachar{O}{1.0D{\texttt{NOP}}}
\newcommand{\timemeasure}[4]
{
\draw [red,semithick] ($ (#1) - (-0.1,0) $) -- ($ (#1) - (-0.1,#3) -(0,1) $);
\draw [red,semithick] ($ (#2) - (-0.1,0) $) -- ($ (#2) - (-0.1,#3) -(0,1) $);
\draw [red,semithick,>=triangle 60, {Latex}-{Latex}] ($ (#1) - (-0.1,#3) $) -- ($ (#2) -
(-0.1,#3) $) node [below,midway] {#4};
}
\newcommand{\timemeasuup}[4]
{
\draw [red,semithick] ($ (#1) - (-0.1,0) $) -- ($ (#1) - (-0.1,#3) -(0,-1) $);
\draw [red,semithick] ($ (#2) - (-0.1,0) $) -- ($ (#2) - (-0.1,#3) -(0,-1) $);
\draw [red,semithick,>=triangle 60, <->] ($ (#1) - (-0.1,#3) $) -- ($ (#2)
- (-0.1,#3) $) node [above,midway] {#4};
}
\newcommand*\circled[1]{
\tikz[baseline=(char.base)]{
\node[shape=circle,draw,inner sep=0.5pt,fill=white] (char) {\scriptsize#1};
}
}
\newcommand*\circledx[1]{
\tikz[baseline=(char.base)]{
\node[shape=circle,draw,inner sep=0.1pt,fill=white] (char) {\tiny\tiny#1};
}
}
\usepackage{circuitikz}
\newcommand\todo[1]{\textcolor{red}{#1}}
\hyphenation{pre-charged}
\hyphenation{DRAMPower}
%\received{20 February 2007}
%\received[revised]{12 March 2009}
%\received[accepted]{5 June 2009}
%%
%% end of the preamble, start of the body of the document source.
\begin{document}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Header
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\title{\papertitle}
%%
%% The "author" command and its associated commands are used to define
%% the authors and their affiliations.
%% Of note is the shared affiliation of the first two authors, and the
%% "authornote" and "authornotemark" commands
%% used to denote shared contribution to the research.
\author{Lukas Steiner}
\email{lukas.steiner@rptu.de}
\orcid{0009-0009-3368-5396}
\affiliation{%
\institution{University of Kaiserslautern-Landau}
\city{Kaiserslautern}
\country{Germany}
}
\author{Thomas Psota}
\email{thomas.psota@iese.fraunhofer.de}
\orcid{0009-0009-3368-5396}
\affiliation{%
\institution{Fraunhofer IESE}
\city{Kaiserslautern}
\country{Germany}
}
\author{Marco Mörz}
\email{marco.moerz@iese.fraunhofer.de}
\orcid{}
\affiliation{%
\institution{Fraunhofer IESE}
\city{Kaiserslautern}
\country{Germany}
}
\author{Derek Christ}
\email{derek.christ@uni-wuerzburg.de}
\orcid{}
\affiliation{%
\institution{Julius-Maximilians-Universität}
\city{Würzburg}
\country{Germany}
}
\author{Matthias Jung}
\email{m.jung@uni-wuerzburg.de}
\orcid{0000-0003-0036-2143}
\affiliation{%
\institution{Julius-Maximilians-Universität}
\city{Würzburg}
\country{Germany}
}
\author{Norbert Wehn}
\email{norbert.wehn@rptu.de}
\orcid{0000-0002-9010-086X}
\affiliation{%
\institution{University of Kaiserslautern-Landau}
\city{Kaiserslautern}
\country{Germany}
}
%%
%% The abstract is a short summary of the work to be presented in the
%% article.
\begin{abstract}
\todo{frequency vs. data rate}
As memory-intensive applications continue to drive demand for high DRAM bandwidth and capacity, accurately modeling the DRAM power consumption has become critical for optimizing system design and meeting power budgets.
Unfortunately, existing open-source DRAM power simulators only support older generations of DRAM standards, while current system designs mainly rely on the newest generation including DDR5, LPDDR5 or HBM3.
These standards support very high data rates beyond 10\,Gbps/pin
This paper presents DRAMPower 5, a revised version of the popular DRAMPower simulator,
%DRAMPower 5 addresses this need by offering an open-source, detailed power analysis tool that now supports the latest JEDEC standards, including DDR5 and LPDDR5. This latest version introduces a refined interface power model, capturing specific interface dynamics that are increasingly relevant for emerging DRAM technologies, enhancing the accuracy of power estimations. Furthermore, DRAMPower 5 is designed with a flexible, modular architecture, enabling straightforward extensibility to support future DRAM standards and custom configurations. These features make DRAMPower 5 an essential tool for researchers and engineers focused on precise, scalable power analysis for current and next-generation DRAM systems.
\end{abstract}
%\paperlogo
\maketitle
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Body Text
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%\input{content/01_intro}
%Intro
%Related
%DRAM Background: Short Intro of DRAM Interface and Core, single ended bidirection DQ, differential data strobe, data sampled when DQS\_t and DQS\_c cross -> double data rate
%
\section{Introduction}
%
The recent expansion of memory-intensive applications has led to increased demand for DRAM bandwidth and capacity in current computing systems.
This demand is particularly pronounced in \textit{Artificial Intelligence} (AI) applications, where specialized accelerator chips with immense DRAM bandwidths beyond 1\,TBps are used.
However, these bandwidths come at the cost of high power consumption.
In datacenters very often around 90\,\% of the system power is consumed by memory~\cite{bou_24}.
Even in embedded augmented reality devices for the Metaverse, memory can account for more than 40\,\% of power consumption \cite{yankao_24}.
Therefore, an accurate estimation of DRAM power consumption is critical in the early stages of design in order to properly dimension the power supply circuits and cooling.
In mobile devices, on the other hand, the overall power budget is constrained to only a few watts.
Nevertheless, it is equally important to accurately estimate DRAM power consumption, for example to explore the power saving potential of new DRAM standards and their additional features to extend battery life.\cite{borgho_18}
In the current state of the art, there are two widely used open-source simulation tools for estimating DRAM power consumption, namely \textit{DRAMPower}~\cite{kargoo_14} and \textit{CACTI-IO}~\cite{joukah_12,joukah_15}.
DRAMPower focuses on the power consumption of the DRAM core, while CACTI-IO models the power consumption of the DRAM interface.
Unfortunately, both tools have not been updated in recent years, so they only provide support for older DRAM standards.
At the same time, current generation DRAM standards like DDR5, LPDDR5 and HBM3 operate at much higher data rates than their predecessors, use novel interconnection techniques, and offer many new features, which requires special consideration for power modeling.
To the best of our knowledge, there is no open-source DRAM power simulator that provides accurate models of both the DRAM core and interface, and supports current generation DRAM standards such as DDR5, LPDDR5 and HBM3.
To fill this gap, this paper presents DRAMPower 5, a completely revised version of the DRAMPower simulator, with a greatly enhanced feature set including both core and interface power modeling, an efficient simulation kernel, \todo{accuracy?} and support for the latest DRAM standards.
This paper makes the following new contributions:
\begin{itemize}
\item We present newly developed core and interface power models that are required to accurately capture current generation DRAM standards.
\item We explain how the inconsistent and incomplete operating current specifications provided in the DRAM standards need to be treated to model core power.
\item We show that at high operating frequencies, the approximations commonly used for interface power modeling result in large errors and a different modeling approach is required.
\item We present a new simulator architecture that can be easily extended by new standards or features and achieves high simulation speeds.
\item \todo{Accuracy simulations}
\item \todo{supported standards!!!}
\end{itemize}
The rest of the paper is structured as follows \todo{...}
%\input{content/02_related_works}
\section{Related Work}
In this section we provide an overview of the related work.
A well-known and often used DRAM power model is the System Power Calculator by Micron~\cite{micron_ddr3_11_kopie_ipsj}.
It is provided in the form of spreadsheets for various JEDEC standards including DDR/2/3/4 and LPDDR2/3/4/4X.
The power estimation is based on data sheet currents and timings for a specific DRAM device and workload specifications like the read-write-ratio or time that the DRAM is in each state.
However, this modeling approach can only achieve a limited accuracy because the actual command trace that is issued to the DRAM by the memory controller is not considered.
In addition, there exist no spreadsheets for current generation standards.
%However, this model is not accurate enough, as it assumes only certain workload characteristics and it is not looking on the actual executed application. There are further limitations in that model: Micron uses the minimal timing constrains from the datasheet specifications instead of the actual timings.
%But in practice there are dependencies between consecutive memory accesses so that the controller may accelerate or postpone commands. Furthermore, Micron assumes that the controller uses a close-page policy (precharge after each memory access) and that there is only one bank open at the same time. Due to this, a large lack of flexibility and accuracy exists in this model.
%
A more accurate simulation tool is DRAMPower~\cite{kargoo_14} , which also relies on data sheet values, but in addition uses a real DRAM command trace as input to model the internal state transitions with cycle accuracy.
However, the internal DRAM states are simplified and the power dependence on the number of active DRAM banks is not considered.
Thus, DRAMPower was enhanced with a bank-sensitive model in~\cite{junmat_16b,matzul_17} to improve its accuracy.
Still, the tool has two drawbacks: it only models core power, but no interface power, and it has not been updated to the latest standards yet.
Another simulation tools similar to DRAMPower is VAMPIRE~\cite{ghoyag_18}.
This tool puts its focus on the power variations between different DRAM modules, within one DRAM module depending on the access location, and the data value dependency.
VAMPIRE is calibrated with measurements of real DRAM modules and provides very accurate results.
However, this presupposes that real measurements are available for the devices to be used, which is not usually the case in the early stages of design. VAMPIRE also supports DDR3 only.
%
There exists an analytical DRAM core-power model by Vogelsang~\cite{vog_10}. This model reflects a DDR memory but is also used to extrapolate future memory power consumption behavior.
When it comes to DRAM interface power modeling, the most popular software is CACTI-IO~\cite{joukah_12,joukah_15}.
CACTI-IO does not rely on data sheet currents, but it uses an equivalent circuit diagram of the DRAM subsystem's real interface architecture as this architecture is not fixed for a specific device.
The power consumption is then calculated using a simplified network analysis.
While this approach leads to accurate results for older generation DRAM standards, the error introduced by the simplifications is significantly higher for current generation standards as they support much higher data rates.
In summary, there is no publicly available DRAM power simulation tool capable of modeling both core and interface power of current generation DRAM standards with high accuracy.
%\input{content/03_overview}
\section{DRAM Background}
%
This section provides the necessary background on the DRAM core and interface that is relevant for power modeling.
It also briefly introduces the different families of DRAM standards and explains their main differences.
%
\subsection{Core}
%
DRAM is a type of memory mainly optimized for a low cost per bit.
To achieve a high storage density, the chips are internally organized in a hierarchical fashion consisting of columns, rows, banks and for newer standards bank groups.
When data should be read or written from or to a column, the corresponding row must be activated first.
Within each bank, only one row can be active at a time and the bank must be precharged before a new row can be activated.
Data is transferred over the interface in a burst fashion, i.e. for a read operation, a large amount of data is fetched internally in parallel from the array to the interface, which is then transferred to the memory controller in multiple beats (usually 8 or 16).
Information is stored as an electrical charge held in a tiny capacitor.
As the capacitor leaks this charge over time, each DRAM cell must be refreshed regularly (usually every 32 to 64\,ms).
The refresh operation is triggered externally by the memory controller with a refresh command.
During refresh, no data can be accessed within the target bank(s).
Thus, only a few rows are refreshed each time to avoid long access delays and a refresh command is sent every few microseconds.
To save energy, DRAM devices can be put into a power down mode when no data accesses are performed.
This disables parts of the core and interface.
However, in order to perform refreshes for data retention, it is necessary to interrupt the power down mode periodically.
To avoid this, the self refresh mode can be entered where data retention is managed by the device itself and no refresh commands need to be provided by the memory controller.
%
\subsection{Interface}
%
All modern DRAM subsystems use a bidirectional single-ended \textit{data bus} (DQ) to transfer data from the memory controller to the DRAM devices or the other way round.
To sample the data at the correct time, a differential \textit{data strobe} pair (DQS\_t/DQS\_c) is provided by the driving side.
Since data is sampled both at the rising and the falling edge of the data strobe (intersection of DQS\_t and DQS\_c), the bus operates at \textit{double data rate} (DDR).
Commands and addresses are transferred from the memory controller to the DRAM devices over a unidirectional \textit{command/address bus} (CA).
They are sampled on the edges of a differential clock signal pair (CK\_t/CK\_c) that is also driven by the memory controller.
Depending on the standard, the command/address bus is either operated at \textit{single data rate} (SDR) or double data rate.
In addition, the transfer of a single command can take one or multiple clock cycles depending on the standard because the width of the command/address bus varies.
Since all modern DRAM standard operate at frequencies in the gigahertz range with data rates reaching up to 10\,Gbps/pin, termination at the receiver side is required to ensure signal integrity.
%
To increase the memory capacity, multiple DRAM devices can be connected to the same memory controller, sharing the command/address as well as data bus (so-called ranks).
Each target device can be selected using a dedicated \textit{chip select} signal (CS).
The physical (inter)connection between memory controller and DRAM can be realized in different ways, e.g., through a classical printed circuit board (PCB), a package on package (PoP) arrangement or a silicon interposer.
All these channels have different characteristics in terms of load capacitances, reflections and loss, so they need to be modeled individually for an accurate power estimation.
One connection type widely used in PCs and servers is the dual inline memory module (DIMM).
Multiple DRAM chips are soldered onto a small PCB with pins on the bottom edge, which is then plugged into a socket on the main PCB.
DIMMs require special considerations for power modeling as there are different wiring topologies, off-die termination schemes and in some cases additional buffer chips for the command/address bus and data bus.
%
\subsection{DRAM Standards}
%
\todo{Special features see Luizas Master Thesis, e.g., DBI, write X, new refresh modes etc.}
Over the last quarter century, JEDEC has published more than 20 different DRAM standards.
As DRAM application fields become more heterogeneous, so do the standards.
Currently, there are four major families:
\begin{itemize}
\item \textit{DDR} is used as general-purpose memory for PCs and servers as it provides high capacities at a low cost. It can be organized as single devices or DIMMs.
\item \textit{Low-Power DDR (LPDDR)} is optimized for a low power consumption and mainly used in battery-powered devices like smartphones or embedded systems.
\item \textit{Graphics DDR (GDDR)} offers higher bandwidths than DDR and is mainly used in GPUs.
\item \textit{High Bandwidth Memory (HBM)} provides even higher bandwidths than GDDR by utilizing a much wider data bus and a silicon interposer for connection. It is mainly used in high-performance GPUs and ASICs.
\end{itemize}
%
While the DRAM core is structured similarly for all the standards, the interfacing between memory controller and DRAM differs greatly.
These differences reach from the pin count over the termination scheme and signaling voltage to the clocking architecture.
In addition, newer standards add additional commands for an optimized performance and power consumption.
\todo{In this first release of the revised version of DRAMPower, we focus on the DDR and LPDDR families, more specifically on DDR3/4/5 and LPDDR4/4X/5/5X.
GDDR and HBM will be added in a future release.}
Figure~\ref{tab:standards} provides an overview of the most important features and characteristics of each standard.
Interface depends on controller and devices -> not fixed for one device
%
DDR3: push pull termination DQ
DDR4: pseudo open drain logic (PODL) DQ, DBI,
LPDDR4: low voltage swing termination logic (LVSTL)
HBM: active inductor, doubled current (Active-Inductive CTLE continuous-time linear equalizer)
Input/output capacitance is specified in standard
%\begin{table*}
% \centering
% \caption{Feature Overview of JEDEC Standards}
% \label{tab:standards}
% \begin{tabular}{c|c|c|c}
% Standard & Termination & Operating Voltages & Special Features\\
% DDR3 & SSTL & 1.5/1.5 & \\
% DDR4 & PODL & 2.5/1.2/1.2 & Data Bus Inversion\\
% DDR5 & PODL & 1.8/1.1/1.1 & Write Pattern, Command \& Address Inversion\\
% LPDDR4 & LVSTL & 1.8/1.1/1.1\\
% LPDDR4X & LVSTL & 1.8/1.1/0.6\\
% LPDDR5 & LVSTL & 1.8/1.05/0.5 & Write Clock\\
% \end{tabular}
%\end{table*}
DIMM Features:
DDR3 UDIMM: Fly-By topology except for DQ/DM and DQS, Push-Pull Termination CK\_t, CK\_c, CTRL (S0\_n, ODT0, CKE0), CMD, Series resistor Rs for DQ/DM and DQS\_t/DQS\_c, other DIMMs: LRDIMM, RDIMM, SODIMM...
DDR4 UDIMM: Fly-By topology except for DQ/DM and DQS, Push-Pull Termination ...
DDR5 UDIMM: Fly-By , other DIMMs: LRDIMM, RDIMM, SODIMM, CUDIMM (clocked unbuffered)...
%
%
\section{Core Power Modeling}
%
This section explains the modeling of core power, while the modeling of interface power is covered in the next section.
Core and interface can be considered completely independent of each other because they always use different supply voltages.
Core power refers to the power consumed by the internal circuitry of the DRAM device, i.e., the memory arrays, sense amplifiers, row and column decoders, I/O gating and control logic.
The receiver circuits at the interface are also operated with the core supply voltage and are therefore included in the core power.
As the internal architecture of modern DRAM devices is very complex and highly proprietary, core power calculation cannot be based on network analysis.
However, each DRAM standard defines a set of currents for fixed operating scenarios, which are listed in vendor datasheets.
Based on these currents, the core power can be estimated.
%%%%
%%%%
The following section provides an overview of these currents.
%
\subsection{Current Measurement Conditions}\label{subsec:current_measurement}
%
The minimum set specified in all DRAM standards includes the following nine currents:
%
\begin{itemize}
\item $I_{DD0}$ (Operating one bank active-precharge current): Activate and precharge commands are sent alternately with minimum spacing. The target bank is toggled with each activate command.
\item $I_{DD2N}$ (Precharge standby current): All banks are precharged and no commands are issued.
\item $I_{DD2P}$ (Precharge power-down current): All banks are precharged, no commands are issued and the device is in power-down mode.
\item $I_{DD3N}$ (Active standby current): All banks are active and no commands are issued.
\item $I_{DD3P}$ (Active power-down current): All banks are active, no commands are issued and the device is in power-down mode.
\item $I_{DD4R}$ (Operating burst read current): All banks are active and read commands are issued with minimum spacing. The target bank is toggled with each read command.
\item $I_{DD4W}$ (Operating burst write current): All banks are active and write commands are issued with minimum spacing. The target bank is toggled with each write command.
\item $I_{DD5B}$ (Burst refresh current): Refresh commands are issued with minimum spacing.
\item $I_{DD6}$ (Self refresh current): The device is in self-refresh operation and the external clock is turned off.
\end{itemize}
%
Unfortunately, the different JEDEC subcommittees, which are responsible for formulating DRAM standards, are very inconsistent in specifying the currents.
Apart from different naming schemes\footnote{To avoid confusion, we use our own naming scheme, which is a mixture of several standards.}, the measurement conditions mentioned above only apply for standards of the DDR family, while they differ for LPDDR, GDDR and HBM.
For example, LPDDR measures $I_{DD3N}$, $I_{DD3P}$, $I_{DD4R}$ and $I_{DD4W}$ with only one bank active.
GDDR measures $I_{DD3N}$ and $I_{DD3P}$ with one bank active, while $I_{DD4R}$ and $I_{DD4W}$ are measured with one bank in each bank group active.
HBM, in turn, measures $I_{DD3N}$ and $I_{DD3P}$ with one bank active and $I_{DD4R}$ as well as $I_{DD4W}$ with all banks active.
Section~\ref{subsec:bankwise} explains how these different measurement conditions are treated to achieve a universal bank-sensitive power model.
Similarly, the refresh currents are also measured under various conditions.
While DDR standards specify a burst refresh current $I_{DD5B}$ for all available refresh modes, LPDDR standards specify a burst refresh current only for all-bank refresh, while for per-bank refresh, an average current $I_{DD5A}$ is provided.
The difference between $I_{DD5B}$ and $I_{DD5A}$ is the spacing between two consecutive refresh commands.
It is the refresh cycle time $t_{RFC}$ (i.e., the duration of a single refresh operation) for $I_{DD5B}$ and the much longer average refresh interval $t_{REFI}$ (i.e., the interval at which refresh commands need to be issued in normal operation) for $I_{DD5A}$.
%GDDR5/5X/6 and HBM1/2 do not specify a current for per-bank refresh at all although they support it.
Section~\ref{subsec:refresh} shows how refresh power can be modeled using the provided currents of each standard.
Even if all missing currents are calculated, the used approach for core power calculation still faces two problems, which have been highlighted in~\cite{ghoyag_18}.
First, the device-to-device variations are very large, which forces the vendors to be very pessimistic when specifying operating currents.
As a consequence, power is overestimated in most cases.
Second, the currents are measured for fixed data and address patterns, i.e., no data dependencies and structural variations within the device are considered.
If a more accurate modeling is required, the calculations have to be refined with additional device measurements.
\todo{last subsection? extra features, maybe future work?}
\todo{multiple supply voltages!}
\subsection{Universal Bank-Sensitive Model}\label{subsec:bankwise}
%
The DRAM core power is composed of background power and command power.
A bank-sensitive model is used for the background power, i.e. the more banks are active, the higher the power consumption.
This model was already introduced in previous versions of the tool~\cite{junmat_16b,matzul_17} and provides higher accuracy compared to a model that only distinguishes between two states (either active or precharged) like the Micron system power calculator~\cite{micron_ddr3_11_kopie_ipsj}.
As shown in Figure~\ref{fig:bank_sensitive_currents} for a DRAM of the DDR family with $B$ banks, $I_{DD2N}$ is drawn when all banks are precharged and $I_{DD3N}$ is drawn when all $B$ banks are active.
The span in between is not divided linearly depending on the number of active banks, but there is an offset when activating the first bank.
This is due to the fact that additional logic must be switched on when the first bank is activated.
%
\begin{figure}
\centering
\input{img/bankwise_current}
\caption{Bank-Sensitive Currents~\cite{junmat_16b}}
\label{fig:bank_sensitive_currents}
\end{figure}
%
$\rho$ is a vendor- and device-specific factor between 0 and 1, which can be determined by measurement~\cite{junmat_16b}.
Alternatively, the pessimistic assumption of $\rho = 1$ can be made, which leads to the simplified model with only two distinct states as used by Micron~\cite{micron_ddr3_11_kopie_ipsj}.
For standards of the DDR family, it is $I_{DD3N} = I_{\circled{B}}$, while for LPDDR, GDDR and HBM, it is $I_{DD3N} = I_{\circled{1}}$.
This difference must be taken into account when calculating the background power.
If the current $I_{DD2N}$, the factor $\rho$, a current $I_{DD3N}$ measured with N banks active, and the total number of banks $B$ is given, all other currents can be calculated.
It is
\begin{equation}
I_{\circled{N}} = I_{DD2N} + (I_{\circled{B}} - I_{DD2N}) \cdot \left(\rho + (1-\rho)\cdot \frac{N}{B}\right)
\end{equation}
When the DRAM is in power-down mode, the dependence of the current on the number of active banks is much smaller, so we only distinguish between two states characterized by $I_{DD2P}$ and $I_{DD3P}$.
The average command power is calculated by counting the number of commands of each type, adding up the energy that is consumed for all these commands, and dividing the total energy by the simulated time.
As for the background power, the differences among the standards must be taken into consideration for the command power as well.
In \cite{junmat_16b}, the energy for a read command $E_{RD}$ is calculated as
\begin{equation}
E_{RD} = V_{DD} \cdot (I_{DD4R} - I_{DD3N}) \cdot \frac{BL}{DR} \cdot t_{CK}
\end{equation}
where $V_{DD}$ is the core supply voltage, $BL$ is the burst length, $DR$ is the data rate and $t_{CK}$ is the clock period.
For a write command, $I_{DD4R}$ is replaced with $I_{DD4W}$.
However, this equation only works if $I_{DD4R}$ and $I_{DD3N}$ are measured with the same number of banks active, which is not the case for GDDR and HBM.
Thus, the equations need to be adapted accordingly, i.e., for GDDR, $I_{DD3N}$ must be replaced with $I_{\circled{BG}}$ with $BG$ being the number of bank groups, while for HBM, $I_{DD3N}$ must be replaced with $I_{\circled{B}}$.
%
%
%\subsection{Current Measurement Conditions}
%%
%All JEDEC standards specify multiple currents for different operating scenarios.
%Using the provided values of vendor datasheets, the power consumption of the devices can be estimated.
%However, JEDEC is very inconsistent with the current measurement conditions, meaning that the same current can describe different operating scenarios depending on the standard.
%For example, the active standby current IDD3N is measured with all banks active for DDR standards, while it is measured with only a single bank active for LPDDR standards.
%Similarly, for IDD4R and IDD4W, all banks are active for DDR, while only a single bank is active for LPDDR.
%This has to be taken into account...
%
%
%
%All modern JEDEC standards specify at least three supply voltages.
%Most standards denote them as VDD, VPP and VDDQ, while LPDDR standards use VDD1, VDD2 and VDDQ.
%VDD/VDD1: main supply voltage
%VPP/VDD2: pump voltage/wordline boost/activation
%VDDQ: supply voltage for output drivers (DQ, DQS\_t, DQS\_c)
%
%The currents specified by the JEDEC standards are provided separately for each supply voltage in the datasheets.
%Most important currents exemplarily for VDD specified in table
%\begin{table}
%\caption{}
%\label{tab:configs}
%\centering
%\resizebox{\linewidth}{!}{%
%\input{img/currents_table}
%}
%\end{table}
%, which is responsible for performing the core memory operations.
%, i.e., storing data in the memory array that was received on the interface, transferring like reading, writing, refreshing, and storing data.
%%
%Components:
%Memory array operations: Power consumed during data access, including charging and discharging the DRAM cells when data is read from or written to them.
%Refresh operations: DRAM cells need to be refreshed periodically to retain their data, and this refresh process consumes core power.
%Row and column decoders: These circuits select which rows and columns in the memory array are accessed during a read or write operation.
%Sense amplifiers: These are used to detect the small charge differences in the memory cells when reading data, and they also contribute to core power consumption.
%Key Characteristics:
%Dependent on the internal operations and how frequently the memory is accessed or refreshed.
%Affected by DRAM timing parameters (like tRCD, tRAS, etc.) and the DRAM's operational state (active vs idle).
%
%Core power is hard to calculate based on physical equations/properties because the DRAM vendors do not publish any information about the internal structure of the DRAM devices.
%However, the vendors publish data sheets of their devices with measured currents for specific operations (see Section current measurement conditions).
%Although these measurements are very pessimistic, they can be used as a rough estimate for core power calculations, can also be refined with own measurements of devices.
%
\subsection{Refresh Power}\label{subsec:refresh}
%
Depending on the DRAM standard, various refresh modes are offered.
They differ in the number of banks that are refreshed with a single command.
All-bank refresh commands target all banks of the device at once.
As no data can be accessed in banks where a refresh is in progress, this mode can cause a large drop in bandwidth.
Thus, newer DRAM standards offer improved refresh modes where only a single bank (per-bank refresh), two banks (per-2-bank refresh) or one bank in each bank group (same-bank refresh) of the device are targeted with a single command, while the remaining banks can still be accessed in the meantime.
The duration of a single refresh command is the refresh cycle time $t_{RFC}$, which is also the spacing of refresh commands when measuring the burst refresh current $I_{DD5B}$.
Thus, when a burst refresh current is provided, the energy for a single refresh command $E_{REF}$ can be calculated as
\begin{equation}
E_{REF} = V_{DD} \cdot \left(I_{DD5B} - I_{\circled{N}}\right) \cdot t_{RFC}
\end{equation}
where $N$ is the number of refreshed banks.
As the equation shows, banks with a refresh in progress are considered active, which the most accurate way of modeling because internally the refresh is performed by successively activating multiple rows within each target bank.
In the cases where only an average refresh current $I_{DD5A}$ is provided, an approximated value for $I_{DD5B}$ can be determined.
Figure demonstrates the relation between both refresh currents graphically, where the dashed boxes represent the energy that is consumed.
The voltage is neglected because it is constant.
%
\begin{figure}
\centering
\resizebox{\linewidth}{!}{%
\input{img/refresh_currents}
}
\caption{Relation between Refresh Currents}
\label{fig:refresh_currents}
\end{figure}
%
From the definitions of the two currents, we know that burst refresh energy and the average refresh energy within one refresh interval $t_{REFI}$ are identical.
This relationship can be translated into the following equation to calculate $I_{DD5B}$ from $I_{DD5A}$:
\begin{equation}
I_{DD5B} = I_{DD2N} + \left(I_{DD5A} - I_{DD2N}\right) \cdot \frac{t_{REFI}}{t_{RFC}}
\end{equation}
%
%The equation can be used to calculate the burst refresh current of different refresh modes by substituting the average refresh current $I_{DD5A}$, refresh interval $t_{REFI}$ and refresh cycle time $t_{RFC}$ with the appropriate values.
%During refresh, the device is considered in active state because internally the banks are constantly activated and refreshed.
%The energy for an all-bank refresh command can be calculated as
%\begin{equation}
% E_{REFab} = V_{DD} \cdot \left(I_{DD5Bab} - I_{DD3N}\right) \cdot t_{RFCab}
%\end{equation}
%When per-bank refresh is used, only a single bank is refreshed at a time.
%Thus, only a single bank is considered active and the equation changes to
%\begin{equation}
% E_{REFpb} = V_{DD} \cdot \left(I_{DD5Bpb} - I_{\circled{1}}\right) \cdot t_{RFCpb}
%\end{equation}
%%
%Same-bank refresh for device with \textit{BG} bank groups and \textit{BA} banks per bank group
%\begin{equation}
% E_{REFpb} = V_{DD} \cdot \left(I_{DD5Bsb} - I_{\circled{BG}}\right) \cdot t_{RFCsb}
%\end{equation}
%
%
\section{Interface Power Modeling}
%
Interface power refers to the power consumed by the drivers for the communication between memory controller and DRAM devices.
In contrast to the core power, which is fixed for a specific device, the interface power depends on the complete DRAM subsystem architecture, i.e., the physical layer (PHY) of the memory controller, the channel architecture (number of ranks, possible usage of DIMMs, etc.) , the channel characteristics (e.g., channel loss and parasitic capacitances) and the DRAM PHYs.
Thus, a modeling based on the operating currents specified in vendor datasheets is not possible as they are only measured for one specific subsystem architecture.
Instead, we calculate the interface power based on an equivalent circuit diagram of the real interface architecture as is also done by CACTI-IO.
Interface power can be divided into \textit{termination power}, which is dissipated across the termination resistances required for signal integrity, and \textit{dynamic power}, which is dissipated through the lossy charging and discharging of parasitic capacitances and the signaling over a lossy transmission line.
In the following two sections, the calculation of termination power and dynamic power is explained.
%
\subsection{Termination Power}
\todo{also counting commands as for core power}
%
%%%%
%\begin{figure}
% \centering
% \begin{circuitikz}
% \ctikzset{bipoles/resistor/height=0.15}
% \ctikzset{bipoles/resistor/width=0.4}
% %\ctikzset{bipoles/tline/width=0.6}
% \draw (0,0) node[ieeestd buffer port, anchor=in](driver){};
% \draw (driver.in) to[short, -o] ++(0,0);
% \draw (driver.up) -- ++(0,1) node[tground](VDDQ){$V_{DDQ}$};
% \draw (driver.down) -- ++(0,-1) node[tlground](VSSQ){$V_{SSQ}$};
% \draw (driver.bout) to[tline=$Z_0$, bipoles/tline/width=1, -*] ++(3,0) coordinate(TL);
% \draw (TL) to[R, a=$2R_{TT}$] (TL|-VDDQ) node[tground]{$V_{DDQ}$};
% \draw (TL) to[R=$2R_{TT}$] (TL|-VSSQ) node[tlground]{};
% \end{circuitikz}
% \caption{Low Voltage Swing Terminated Logic (LVSTL) Interface}
% \label{fig:term_push_pull}
%\end{figure}
%%%%%
%\begin{figure}
% \centering
% \begin{circuitikz}
% \ctikzset{bipoles/resistor/height=0.15}
% \ctikzset{bipoles/resistor/width=0.4}
% \draw (0,0) node[tground](VDDQ){} to[R=$R_{ON}$] ++(0,-1.5) to[short=$"1"$] ++(3,0) to[R=$R_{TT}$] ++(0,-1.5) node[tground](VDDQ2){};
% \node[anchor=south] at (VDDQ) {$V_{DDQ}$};
% \node[anchor=north] at (VDDQ2) {$V_{DDQ}/2$};
% \end{circuitikz}
% \caption{Driving Logic 1 on Push Pull Termination Interface}
% \label{fig:term_push_pull}
%\end{figure}
%%
%\begin{figure}
% \centering
% \begin{circuitikz}
% \ctikzset{bipoles/resistor/height=0.15}
% \ctikzset{bipoles/resistor/width=0.4}
% \draw (0,0) node[tlground]{} to[R,a=$R_{ON}$] ++(0,1.5) to[short=$"0"$] ++(3,0) to[R,a=$R_{TT}$] ++(0,1.5) node[tground](VDDQ2){};
% \node[anchor=south] at (VDDQ2) {$V_{DDQ}/2$};
% \end{circuitikz}
% \caption{Driving Logic 0 on Push Pull Termination Interface}
% \label{fig:term_push_pull}
%\end{figure}
%%
%\begin{figure}
% \centering
% \begin{circuitikz}
% \ctikzset{bipoles/resistor/height=0.15}
% \ctikzset{bipoles/resistor/width=0.4}
% \draw (0,0) node[tlground]{} to[R,a=$R_{ON}$] ++(0,1.5) to[short=$"0"$] ++(3,0) to[R,a=$R_{TT}$] ++(0,1.5) node[tground](VDDQ){};
% \node[anchor=south] at (VDDQ) {$V_{DDQ}$};
% \end{circuitikz}
% \caption{Driving Logic 0 on POD Interface}
% \label{fig:term_push_pull}
%\end{figure}
%%
%\begin{figure}
% \centering
% \begin{circuitikz}
% \ctikzset{bipoles/resistor/height=0.15}
% \ctikzset{bipoles/resistor/width=0.4}
% \draw (0,0) node[tground](VDDQ){} to[R=$R_{ON}$] ++(0,-1.5) to[short=$"1"$] ++(3,0) to[R=$R_{TT}$] ++(0,-1.5) node[tlground](VSSQ){};
% \node[anchor=south] at (VDDQ) {$V_{DDQ}$};
% \end{circuitikz}
% \caption{Driving Logic 1 on LVSTL Interface}
% \label{fig:term_push_pull}
%\end{figure}
%
The termination power depends on the termination scheme, the ratio between the two logic levels, and the time at which the termination is enabled, but it is independent of the operating frequency.
There are three commonly used termination schemes for DRAM, shown in Figure~\ref{fig:term} for a simple point-to-point connection.
%
\begin{figure*}
\centering
\begin{subfigure}[b]{0.32\linewidth}
\centering
\resizebox{\linewidth}{!}{%
\begin{circuitikz}
\ctikzset{bipoles/resistor/height=0.15}
\ctikzset{bipoles/resistor/width=0.4}
%\ctikzset{bipoles/tline/width=0.6}
\draw (0,0) node[pmos, emptycircle, anchor=D](P){};
\draw (0,0) node[nmos, anchor=D](N){};
\draw (P.S) -- ++(0,0) node[tground](VDDQ){};
\node[anchor=south] at (VDDQ) {$V_{DDQ}$};
\draw (N.S) -- ++(0,0) node[tlground](VSSQ){};
\draw (P.G) -- (N.G);
\draw (P.south west) to[short, *-o] ++(-0.5,0);
\draw (P.D) to[tline=$Z_0$, bipoles/tline/width=1, *-*] ++(3,0) coordinate(TL);
\draw (TL) to[R, a=$R_{TT}$] (TL|-VDDQ) node[tground](VDDQ1){};
\node[anchor=south] at (VDDQ1) {$V_{DDQ}$};
\draw (TL) -- ++(1.5,0) node[plain amp, anchor=in up, scale=0.6](recv){};
\draw (recv.bin down) to[short=$V_{ref}$, -o] ++(-0.5,0);
\draw (recv.bout) to[short, -o] ++(0.5,0);
\end{circuitikz}%
}
\caption{Pseudo Open Drain Logic (PODL)}
\label{fig:term_podl}
\end{subfigure}
%
\begin{subfigure}[b]{0.32\linewidth}
\centering
\resizebox{\linewidth}{!}{%
\begin{circuitikz}
\ctikzset{bipoles/resistor/height=0.15}
\ctikzset{bipoles/resistor/width=0.4}
%\ctikzset{bipoles/tline/width=0.6}
\draw (0,0) node[pmos, emptycircle, anchor=D](P){};
\draw (0,0) node[nmos, anchor=D](N){};
\draw (P.S) -- ++(0,0) node[tground](VDDQ){};
\node[anchor=south] at (VDDQ) {$V_{DDQ}$};
\draw (N.S) -- ++(0,0) node[tlground](VSSQ){};
\draw (P.G) -- (N.G);
\draw (P.south west) to[short, *-o] ++(-0.5,0);
\draw (P.D) to[tline=$Z_0$, bipoles/tline/width=1, *-*] ++(3,0) coordinate(TL);
\draw (TL) to[R=$R_{TT}$] (TL|-VSSQ) node[tlground]{};
\draw (TL) -- ++(1.5,0) node[plain amp, anchor=in up, scale=0.6](recv){};
\draw (recv.bin down) to[short=$V_{ref}$, -o] ++(-0.5,0);
\draw (recv.bout) to[short, -o] ++(0.5,0);
\end{circuitikz}%
}
\caption{Low Voltage Swing Term. Logic (LVSTL)}
\label{fig:term_lvstl}
\end{subfigure}
%
\begin{subfigure}[b]{0.32\linewidth}
\centering
\resizebox{\linewidth}{!}{%
\begin{circuitikz}
\ctikzset{bipoles/resistor/height=0.15}
\ctikzset{bipoles/resistor/width=0.4}
%\ctikzset{bipoles/tline/width=0.6}
\draw (0,0) node[pmos, emptycircle, anchor=D](P){};
\draw (0,0) node[nmos, anchor=D](N){};
\draw (P.S) -- ++(0,0) node[tground](VDDQ){};
\node[anchor=south] at (VDDQ) {$V_{DDQ}$};
\draw (N.S) -- ++(0,0) node[tlground](VSSQ){};
\draw (P.G) -- (N.G);
\draw (P.south west) to[short, *-o] ++(-0.5,0);
\draw (P.D) to[tline=$Z_0$, bipoles/tline/width=1, *-*] ++(3,0) coordinate(TL);
\draw (TL) to[R, a=$2R_{TT}$] (TL|-VDDQ) node[tground](VDDQ1){};
\node[anchor=south] at (VDDQ1) {$V_{DDQ}$};
\draw (TL) to[R=$2R_{TT}$] (TL|-VSSQ) node[tlground]{};
\draw (TL) -- ++(1.5,0) node[plain amp, anchor=in up, scale=0.6](recv){};
\draw (recv.bin down) to[short=$V_{ref}$, -o] ++(-0.5,0);
\draw (recv.bout) to[short, -o] ++(0.5,0);
\end{circuitikz}%
}
\caption{Stub Series Terminated Logic (SSTL)}
\label{fig:term_sstl}
\end{subfigure}
%
\caption[DRAM Interface Termination Schemes]{DRAM Interface Termination Schemes\footnotemark}
\label{fig:term}
\end{figure*}
\footnotetext{The pull-up driver can be implemented with either PMOS or NMOS transistors.}
%
\textit{Pseudo open drain logic} (PODL) and \textit{low voltage swing terminated logic} (LVSTL) only use a pull-up or a pull-down resistor, respectively.
In contrast, \textit{stub series terminated logic} (SSTL) uses both a pull-up and a pull-down resistor.
In all three cases, the termination resistance is matched the characteristic impedance of the transmission line, i.e., $R_{TT} \approx Z_0$ (remember that in AC analysis a DC voltage source is treated as a short).
To calculate the power, both logic levels are considered separately.
The transistor of the driver that is switched on is replaced with an equivalent resistor with resistance $R_{ON}$, while the transistor that is switched off is replaced with an open line.
As an example, Figure~\ref{fig:terminations} shows the two equivalent circuit diagrams for a PODL interface.
%
\begin{figure}
\centering
\begin{circuitikz}
\ctikzset{bipoles/resistor/height=0.15}
\ctikzset{bipoles/resistor/width=0.4}
\draw (0,0)
node[tground](VDDQ1){}
to [R=$R_{ON}$] ++(0,-1.5) coordinate(x1)
to [short=$"1"$, name={s1}] ++(2,0) coordinate(x2)
to [R,a=$R_{TT}$] ++(0,1.5) node[tground](VDDQ2){};
\node[anchor=south] at (VDDQ1) {$V_{DDQ}$};
\node[anchor=south] at (VDDQ2) {$V_{DDQ}$};
\draw(x2) to [open] ++(1.75,0) coordinate(x3)
to [R=$R_{ON}$] ++(0,-1.0) node[ground](x4){};
\draw(x3)
to [short=$"0"$, name={s2}] ++(2,0)
to [R,a=$R_{TT}$] ++(0,1.5) node[tground](VDDQ3){};
\node[anchor=south] at (VDDQ3) {$V_{DDQ}$};
\path(x4) ++(0,-1.0) coordinate(x5);
\draw(s1|-x5) node[](){\bfseries (a) Driving Logic "1"};
\draw(s2|-x5) node[](){\bfseries (b) Driving Logic "0"};
\end{circuitikz}%
%\begin{subfigure}[t]{0.49\linewidth}
%\centering
%\resizebox{\linewidth}{!}{%
% \begin{circuitikz}
% \ctikzset{bipoles/resistor/height=0.15}
% \ctikzset{bipoles/resistor/width=0.4}
% \draw (0,0)
% node[tground](VDDQ1){}
% to [R=$R_{ON}$] ++(0,-1.5) coordinate(foo)
% to [short=$"1"$] ++(3,0)
% to [R,l=$R_{TT}$] ++(0,1.5) node[tground](VDDQ2){};
% \node[anchor=south] at (VDDQ1) {$V_{DDQ}$};
% \node[anchor=south] at (VDDQ2) {$V_{DDQ}$};
% \draw[white](foo) to [R] ++(0,-1.0) node[ground](VDDQ){};
% \end{circuitikz}}
%\caption{Driving Logic "1"}
%\label{fig:term_logic_1}
%\end{subfigure}
%%
%\begin{subfigure}[t]{0.49\linewidth}
%\centering
%\resizebox{\linewidth}{!}{%
% \begin{circuitikz}
% \ctikzset{bipoles/resistor/height=0.15}
% \ctikzset{bipoles/resistor/width=0.4}
% \draw (0,0) node[tlground]{} to[R,a=$R_{ON}$] ++(0,1.5) to[short=$"0"$] ++(3,0) to[R,l=$R_{TT}$] ++(0,1.5) node[tground](VDDQ){};
% \node[anchor=south] at (VDDQ) {$V_{DDQ}$};
% \end{circuitikz}}
%\caption{Driving Logic "0"}
%\label{fig:term_logic_0}
%\end{subfigure}
\caption{Equivalent Circuit Diagrams for PODL Termination Power}
\label{fig:terminations}
\end{figure}
%
When driving a logic "1", both ends of the circuit are connected to $V_{DDQ}$, which means that no current is flowing and no power is dissipated, i.e.,
\begin{equation}
P_{term,1}^{PODL} = 0.
\end{equation}
In contrast, when driving a logic "0", one side is connected to ground, while the other side is connected to $V_{DDQ}$.
The dissipated power is calculated as
\begin{equation}
P_{term,0}^{PODL} = \frac{V_{DDQ}^2}{R_{ON} + R_{TT}}.
\end{equation}
In the case of an LVSTL interface, the equations for both logic levels are reversed. %, i.e.,
%\begin{equation}
%P_{term,0}^{LVSTL} = 0
%\end{equation}
%and
%\begin{equation}
%P_{term,1}^{LVSTL} = \frac{V_{DDQ}^2}{R_{ON} + R_{TT}}.
%\end{equation}
The SSTL interface uses both a pull-up and a pull-down resistor, therefore, power is dissipated at both logic levels.
It can be calculated as
\begin{equation}
P_{term,0}^{SSTL} = P_{term,1}^{SSTL} = \frac{V_{DDQ}^2}{(R_{ON}||2R_{TT})+2R_{TT}}.
\end{equation}
%
The average termination power for transmitting $n_0$ logic zeros and $n_1$ logic ones is
\begin{equation}
P_{term} = \frac{P_{term,0} \cdot n_0 + P_{term,1} \cdot n_1}{n_0 + n_1}
\end{equation}
%dissipated termination energy depends on the bit period $t_b$ (minimum time that signal is at one logic level) and the number of transmitted logic zeros $n_0$ and logic ones $n_1$.
%It is calculated as
%\begin{equation}
% E_{term} = (P_{term,0} \cdot n_0 + P_{term,1} \cdot n_1) \cdot t_b.
%\end{equation}
Because with PODL and LVSTL only one logic level consumes power, data bus inversion can be used to reduce the termination power consumption.
With SSTL, the termination power is independent of the transmitted data.
When using channel configurations with multiple ranks or DIMMs, the interconnect network can change from a simple point-to-point topology to a more complex topology, e.g., because the non-target dies also terminate the bus.
In these cases, termination power can be calculated in the same way by determining the equivalent circuit diagrams for both logic levels.
%
%Figures show simplest networks consisting of driver with pull-up and pull-down on resistance $R_{ON}$, lossless transmission line with impedance $Z_0$ and termination resistance $R_{TT}$.
%$R_{TT}$ is chosen to match $Z_0$ and typically has a value of $50 \Omega$.
%For push pull termination, static termination power is consumed both when logic 1 and 0 is transmitted and it is calculated as
%\begin{equation}
% P_{term} = \frac{V_{DDQ}^2}{4 \cdot (R_{ON} + R_{TT})}.
%\end{equation}
%In contrast, for POD static termination power is only consumed when logic 0 is transmitted and for LVSTL static termination power is consumed when logic 1 is transmitted.
%This has the advantage that idle bus can be "parked" at either 0 or 1 and no static power is consumed.
%It is calculated as
%\begin{equation}
% P_{term} = \frac{V_{DDQ}^2}{R_{ON} + R_{TT}}
%\end{equation}
%
%Core power is fixed for specific device -> calculation based on currents specified in datasheets
%Interface power depends on Controller PHY, PCB, socket etc. -> calculation based on physical laws
%Since supply voltage of drivers is separated from supply voltage of core, calculation can be split up like this
%% kurz: was ist DRAM
%%% wie spielen verschiedene Standards da rein
%%% unterschiede in specs bei versch. herstellern / modellen
%%% instruktionen u. korrelation zu energie verbrauch
%
%\input{content/04_drampower}
%
\subsection{Dynamic Power}
%
As shown in the previous section, termination power is frequency independent because it is dissipated across a purely resistive network.
Termination power represents a lower bound for the total power consumption and also dominates at low operating frequencies.
However, since current generation DRAM standards support data rates of 10\,Gbps/pin and more, the impact of parasitic capacitances is much more significant.
Figure~\ref{fig:load_caps} shows the simple point-to-point connection with PODL termination scheme as already presented in Figure~\ref{fig:term_podl}, but with two added parasitic capacitances, one at the driver side and one at the receiver side.
%
\begin{figure}
\centering
\resizebox{\linewidth}{!}{%
\begin{circuitikz}
\ctikzset{bipoles/resistor/height=0.15}
\ctikzset{bipoles/resistor/width=0.4}
%\ctikzset{bipoles/tline/width=0.6}
\draw (0,0) node[pmos, emptycircle, anchor=D](P){};
\draw (0,0) node[nmos, anchor=D](N){};
\draw (P.S) -- ++(0,0) node[tground](VDDQ){};
\node[anchor=south] at (VDDQ) {$V_{DDQ}$};
\draw (N.S) -- ++(0,0) node[tlground](VSSQ){};
\draw (P.G) -- (N.G);
\draw (P.south west) to[short, *-o] ++(-0.5,0);
\draw (P.D) to[short,*-*] ++(0.8,0) coordinate(D1) to[tline=$Z_0$, bipoles/tline/width=1, *-*] ++(3,0) coordinate(D3) to[short,-*] ++(0.8,0) coordinate(D4);
\draw (D4) to[R,a=$R_{TT}$] (D4|-VDDQ) node[tground](VDDQ1){};
\node[anchor=south] at (VDDQ1) {$V_{DDQ}$};
\draw (D4) -- ++(1.5,0) node[plain amp, anchor=in up, scale=0.6](recv){};
\draw (recv.bin down) to[short=$V_{ref}$, -o] ++(-0.5,0);
\draw (recv.bout) to[short, -o] ++(0.5,0);
\draw (D1) to[C=$C_{TX}$] (D1|-VSSQ) node[tlground]{};
\draw (D3) to[C,a=$C_{RX}$] (D3|-VSSQ) node[tlground]{};
\end{circuitikz}}
\caption{Point-to-Point Connection with Parasitic Capacitances}
\label{fig:load_caps}
\end{figure}
%
We analyze the power dissipation of this circuit for different operating frequencies as input using SPICE.
The components are dimensioned as $R_{ON}$ = \SI{48}{\ohm}, $R_{TT}$ = \SI{60}{\ohm}, $C_{TX}$ = $C_{RX}$ = \SI{1}{\pico\farad} and $V_{DDQ}$ = \SI{1.1}{\volt}, which is in the order of a real DDR5 interface.
For now, the transmission line is also only modeled as a parasitic capacitance with $C_{TL}$ = \SI{2}{\pico\farad}.
%
%\begin{figure}
% \centering
% \begin{tikzpicture}
% \begin{axis}[
% xlabel={Operating Frequency [MHz]},
% ylabel={Power Dissipation [mW]},
% xmode=log,
% xmin=50,
% xmax=6400,
% xtick={50,100,200,400,800,1600,3200,6400},
% xticklabels={50,100,200,400,800,1600,3200,6400}
% ]
% \addplot coordinates {(50,5.02) (100,5.05) (200,5.11) (400,5.22) (800,5.42) (1600,5.8) (3200,6.46) (6400,7.09)};
% \end{axis}
% \end{tikzpicture}
% \caption{Caption}
% \label{fig:enter-label}
%\end{figure}
%
At a frequency of \SI{100}{\mega\hertz}, the dissipated power is \SI{5.7}{\milli\watt}, which is close to the termination power of the circuit of \SI{6.1}{\milli\watt}.
However with increasing frequencies, the power also increases because the capacitors start to conduct.
At \SI{1600}{\mega\hertz} (i.e., 3.2\,Gbps/pin at DDR), the dissipated power is already \SI{8.6}{\milli\watt}, i.e., \SI{40}{\percent} higher than the pure termination power.
To calculate the power dissipation analytically, the clock signal with frequency $f$ and voltage swing $V_{DDQ}$ can be expressed as a Fourier series
\begin{equation}
v(t) = \frac{V_{DDQ}}{2} + \Re \left\{\frac{-2j \cdot V_{DDQ}}{\pi} \sum_{k=1,3,5,\dots}^{\infty} \frac{1}{k} \exp(j 2 \pi f k t)\right\}.
\end{equation}
with DC component $\frac{V_{DDQ}}{2}$.
%
The complex amplitudes $\underline{\hat{V}}_k$ of the frequency components can be directly determined from this equation as
\begin{equation}
\underline{\hat{V}}_k = \frac{-2j \cdot V_{DDQ}}{\pi} \cdot \frac{1}{k}.
\end{equation}
With the frequency-dependent complex impedances $\underline{Z}_k$ calculated as
\begin{equation}
\underline{Z}_k = R_{ON} + \frac{1}{j 2 \pi f k (C_{TX} + C_{RX}) + \frac{1}{R_{TT}}},
\end{equation}
the DC resistance $R_{DC}$ calculated as
\begin{equation}
R_{DC} = R_{ON} + R_{TT},
\end{equation}
and the voltage across $R_{DC}$ calculated as
\begin{equation}
V_{DC} = V_{DDQ} - \frac{V_{DDQ}}{2} = \frac{V_{DDQ}}{2},
\end{equation}
the total power dissipation $P_{total}$ can be calculated as
\begin{equation}\label{eq:fourier}
P_{total} = \frac{V_{DC}^2}{R_{DC}} + \sum_{k=1,3,5,\dots}^{\infty} \frac{|\underline{\hat{V}}_k|^2}{2} \cdot \Re \left\{\frac{1}{\underline{Z}_k}\right\}.
\end{equation}
%
In reality, the series needs to be terminated at a certain $k$, which can be chosen to match the finite slew rate of the signal.
For LVSTL the same equations can be applied, while for SSTL the calculation of the DC component needs to be adapted.
The dynamic power $P_{dyn}$, which adds to the termination power due to the toggling between both logic levels, is finally calculated as
\begin{equation}
P_{dyn} = P_{total} - P_{term}.
\end{equation}
One alternative formula, which is often used to calculate the dynamic power $P_{dyn}$, is given by
\begin{equation}\label{eq:approx}
P_{dyn} = \left(\sum_i C_i V_{sw,i}\right) \frac{V_{DDQ} \cdot f}{2}
\end{equation}
where $C_i$ are the capacitances along the channel and $V_{sw,i}$ the respective voltage swings at each capacitance~\cite{bak_90,joukah_12,joukah_15}.
The voltage swings are usually determined using a DC analysis for both logic levels.
While this approximation provides accurate results at low operating frequencies, current generation DRAM interfaces do not reach full swing anymore due to the large parasitic capacitances in combination with high operating frequencies.
Figure~\ref{fig:power_comp} shows the total power dissipation at different operating frequencies calculated with SPICE, Equation~\ref{eq:fourier} and Equation~\ref{eq:approx}.
%
\begin{figure}
\centering
\begin{tikzpicture}
\begin{axis}[
xlabel={Operating Frequency [MHz]},
ylabel={Power Dissipation [mW]},
xmode=log,
xmin=50,
xmax=12800,
xtick={100,200,400,800,1600,3200,6400},
xticklabels={100,200,400,800,1600,3200,6400},
ybar,
bar width=2mm,
legend pos=north west
]
\addplot+ coordinates {(100,6.2) (200,5.1) (400,5.2) (800,5.4) (1600,5.8) (3200,6.47) (6400,7.09)};
\addplot+ coordinates {(100,6.2) (200,5.1) (400,5.2) (800,5.4) (1600,5.8) (3200,6.47) (6400,7.09)};
\addplot+ coordinates {(100,6.2) (200,5.1) (400,5.2) (800,5.4) (1600,5.8) (3200,6.6) (6400,8.2)};
\legend{SPICE, Fourier Series (This Work), Approximation (CACTI-IO)}
\end{axis}
\end{tikzpicture}
\caption{Comparison of Different Calculation Methods for Power Dissipation}
\label{fig:power_comp}
\end{figure}
%
While Equation~\ref{eq:fourier} always provides the same results as SPICE, Equation~\ref{eq:approx} is accurate at low frequencies, but overestimates the power dissipation at higher frequencies, e.g., by \SI{16}{\percent} at \SI{6400}{\mega\hertz}.
If the capacitances $C_{TX}$ and $C_{RX}$ are increased from \SI{1}{\pico\farad} to \SI{2}{\pico\farad}, the error at \SI{6400}{\mega\hertz} is as high as \SI{54}{\percent}.
The impact of the transmission line can be handled in different ways.
In \cite{holsta_19}, the authors have analyzed various physical DRAM interfaces, i.e., multi DIMM, package on package, PCB trace and silicon interposer.
They show that the channels have very distinct insertion loss characteristics, which need to be taken into consideration for an accurate power estimation.
A linear loss characteristic can be approximated with matching capacitances, while more complex loss characteristics can be approximated with frequency-dependent resistance values in the Fourier series based calculation.
Up until now, the formulas for dynamic power consumption assume a switching activity of $\alpha = 1$, i.e., the signals transition from logic zero to logic one once every period.
While this is true for clock and data strobe signals, the command/address bus and data bus usually experience lower switching activities.
Especially when the bus is only operated at SDR, the switching activity is limited to $\alpha_{max} = 0.5$.
The problem is that the switching activity $\alpha$ and number of transmitted zeros $n_0$ and ones $n_1$ alone do not determine the complete signal behavior, which is demonstrated in Figure~\ref{fig:switching_signals}.
%
\begin{figure}
\centering
\resizebox{\linewidth}{!}{%
\input{img/switching_signals}
}
\caption{Two Different Signals with Identical $\alpha$, $n_0$ and $n_1$}
\label{fig:switching_signals}
\end{figure}
%
Both S1 and S2 have a switching activity of $\alpha = 0.5$ and the number of transmitted zeros and ones is $n_0 = n_1 = 8$.
However, S1 operates at half the clock frequency for the whole time, while S2 operates at the full clock frequency in the beginning and only one fifth of the clock frequency in the end.
When the dissipated power is calculated section by section using Equation~\ref{eq:fourier}, the results for S1 and S2 differ because different voltage swings are reached in each section.
In the corner cases, a signal with switching activity $\alpha$ can be either modeled with a constant switching activity or with a maximum switching activity $\alpha_{max}$ for one part of the time and a switching activity of 0 for the other part of the time.
The actual dynamic power consumption lies between these two corner cases and is in the rest of the paper approximated by the mean value
\begin{equation}
\overline{P}_{dyn}(\alpha) = \frac{P_{dyn}(f=\alpha \cdot f_{max}) + \frac{\alpha}{\alpha_{max}} \cdot P_{dyn}(f = \alpha_{max} \cdot f_{max})}{2}.
\end{equation}
%
Finally, the switching activity $\alpha$ can be determined by counting the number of zero to one transitions $n_{0 \rightarrow 1}$ in a given time interval $\tau$ as
\begin{equation}
\alpha = \frac{n_{0 \rightarrow 1}}{\tau \cdot f_{max}}.
\end{equation}
\todo{data dependent interface power calc, we count n0, n1 and toggles for CMD/ADDR and data bus}
%
%
%the assumption was that full toggle rate/ switching activity, only the case for clock signal
%-> calculate energy for one toggle, count toggles, figure for full toggle no toggle vs half toggle whole time
%
%While these equations work for a clock signal
%
%only depends on resistances independent of the frequency,
%
%
%
%To calculate the dynamic power, all parasitic capacitances and transmission lines along the DRAM channel need to be considered.
%The simplest case is a point-to-point connection between the memory controller and a single DRAM device.
%Figure~\ref{fig:load_caps} shows the equivalent circuit diagram for this case, assuming low voltage swing terminated logic.
%On the driver side, all parasitic capacitances (on-chip pad and IC package) are combined into the driver load capacitance $C_{TX}$, while on the receiver side, all parasitic capacitances are combined into the receiver load capacitance $C_{RX}$.
%
%
%
%
%Load capacitances are charged and discharged, power is dissipated in driver, dynamic power, only when switching happens.
%Simple case without multiple ranks: Effective line capacitance of channel $C_{line}$ and pad + package both on driver and receiver side ($C_{TX}$ and $C_{RX}$)
%
%
%
%
%
%CACTI-IO: Several capacitances:
%$C_{int}$: Internal IO loading (loading within the IO, due to predriver nets), full swing
%$C_{tx}$: IO TX self-load including package (loading at the CPU TX pin), lower swing
%$C_{data}$: Device loading per memory data pin (DRAM device load for DQ per die)
%$C_{addr}$: Device loading per memory address pin (DRAM device load for CA per die)
%!!!different capacitances for different package types!!!
%If the signal rise time $t_r$ is less than or comparable to the transmission line flight time $t_f$, transmission line behavior becomes significant. Rule of thumb: $t_r < 2.5 t_f$
%Up to 1GHz transmission lines can be considered lossless
%Transmission line with impedance $Z_0$ can be expressed by effective line capacitance $C_{line}$, which depends on flight time:
%\begin{equation}
% C_{line} = \frac{t_f}{Z_0}
%\end{equation}
%If half bit period $\frac{t_b}{2}$ is less than flight time $t_f$, $C_{line}$ depends on $t_b$:
%\begin{equation}
% C_{line} = \frac{t_b}{Z_0} = \frac{1}{2f Z_0}
%\end{equation}
%
%intrinsic capacitance of driver with full swing VDDQ, capacitance of pad,
%Interconnect Power:
%Interconnect acts as lossless? transmission line, has characteristic impedance, dynamic power dissipated in driver when switching happens
%Termination Power:
%Far-end termination, static power depending on signal value and termination type, power is dissipated in driver (Ron), target termination resistor (and non-target termination resistor)
%\begin{equation}
% P_{dyn} = N_{pins} D_c \alpha \left(\sum_i C_i V_{sw,i}\right) V_{DDQ} f
%\end{equation}
%Voltage swing is usually less than VDD, duty cycle should be 1 because charging process takes the complete period at high frequencies -> driven signal looks like rectangle
%
%\todo{Lossless transmission line up to 1 GHz}
%\todo{Gewichtung von average Swing mit alpha}
%
%This power is associated with the transfer of commands from the memory controller to the DRAM devices and transferring data in and out of the DRAM.
%Components:
%I/O buffers: These circuits drive the data onto the external data bus when the DRAM reads or receives data from the bus during a write.
%Data and command/address bus activity: Every time data is transferred to or from the DRAM, the I/O circuitry consumes power. This includes driving the clock signals, data lines (DQ), address lines, and control signals.
%Key Characteristics:
%Dependent on the activity of the external data bus, i.e., how frequently data is transferred to and from the DRAM.
%Scales with data rate: Higher data rates (e.g., DDR4, DDR5) increase interface power due to more frequent toggling of the I/O signals.
%Often includes power consumed by termination resistances, which are used to improve signal integrity on the high-speed bus.
%While interface power is generally lower than core power, it can become significant at high memory speeds, especially in modern DRAM technologies like DDR4 and DDR5.
%%
%In contrast to core power, interface characteristics are mainly specified in standard (caps, resistors, termination, driving strength, voltage etc.)
%Depending on interface topology, power can vary greatly. Although standard also defines currents that are drawn over interface supply voltage (typically VDDQ), these currents only describe one specific test setup.
%Data that is transmitted has fixed patterns.
%In addition, power consumption of memory controller PHY is not considered, also contributes to power consumption of DRAM subsystem.
%Thus, we calculate interface power based on physical equations.
%%
%DRAM interface connections: differential clock signal ->, command/address bus ->, data bus <->, data strobe <->
%Depending on standard, there are more/different signals
%%
%While the core power is fixed for a specific device and characterized in the datasheet with different operating currents, the interface power not only depends on the device itself, but also on the physical interface (PHY) of the memory controller, the interconnect channel (PCB, TSV etc.) and other chips connected to the same channel (multi-rank configurations).
%Thus, the currents specified in the datasheet for the I/O driver supply voltage VDDQ cannot be used for accurate estimations.
%Instead, the I/O power calculation is based on physical equations to model transmission lines as shown in \cite{joukah_15,joukah_12}.
%Interface power depends on package type (TSV, POP, device soldered on PCB, DIMM: UDIMM, RDIMM, LRDIMM...), ranks etc.
%
%
%
\section{Simulator Architecture}
%
No standalone simulator, but coupled to e.g. DRAMSys
\todo{ranks}
\todo{count 1, 0 and 0->1 based on issued commands and data, alternatively use average values}
\todo{count commands and clock cycles in each state for background power}
The simulation kernel of DRAM Power uses a timestep based systems, to create a cycle accurate depiction of memory accesses. Different DRAM Standards are modeled as different classes inside the source code, to more accurately depict differences in DRAM behaviour.
The kernel takes as input a Memory Specification (MemSpec) file and a command list. MemSpecs are a machine readable representation of a DRAM's spec sheet formatted as JSON. The command list contains traces of DRAM commands sent to the DRAM controller with corresponding timestamps, which will be processed during simulations. The command list can either be created manually or supplied in form of an input file, or from external tools directly, like simulation traces from DRAMSys.
The simulation starts at timestamp t = 0 and iteratively processes each single command from the command list. Certain commands can issue following commands at a delayed cycle relative to their own execution. [braucht beispiel] Those deferred commands are referred as implicit commands inside DRAMPower and are inserted back into a command queue with a given timestamp. During every simulation step, the kernel checks if the command queue has pending implicit commands and executes them according to their timestamp.
DRAM Standards inside DRAMPower are programmatically modelled as classes. Since only a handful of behaviors are shared between DRAM Standards, each standard warrants its own implementation inside DRAMPower. Every implemented DRAM Standards inherits from a common base class, which handles all interaction with the kernel. The kernel dispatches commands to the instanced DRAM class, which then routes them through it's own function table, where commands are associated with implemented functions inside the DRAM class.
DRAMPower is also able to calculate interface power consumption. This is being achieved by simulating a bit accurate depiction of the command and data busses of a DRAM device. Each command of a given DRAM standard has a specified bit pattern, which is used by the controller to distinguish between commands. During execution the bits on the command bus constantly change, since the data bus is being overwritten with every incoming new command. This means, that the bits on the command bus can flip between cycles, thus leading to increases in power consumption. The same effect applies to the data bus as well, which is used to handle read and write commands.
%
\subsection{Simulation Kernel}
%
Windowing: Power can be evaluated during running simulation -> power over time is possible
Handling implicit commands:
Examples: Power Down Entry is not done when command is issued, but might be delayed
RDA/WRA: auto-precharge is done after RD/WR is internally completed or only after tRAS is expired
when command is issued, implicit command (lambda) is inserted into deque of implicit commands that is ordered by timestamp
before we execute a new command or we request the window stats, we check if there are still outstanding requests in the implicit command queue with a timestamp smaller or equal to the current time
%
DRAMPower does not use an event-driven simulation kernel, but it is only triggered externally when new commands are issued or when the total energy up to a certain point/the current time is requested.
However, there is the case that a command that is issued at time $t$ only triggers an internal action/operation at time $t+x$.
Thus, DRAMPower internally uses a queue that consists of a pair of a timestamp and a lambda expression.
When a command is issued that triggers an action in the future, a lambda expression with the respective timestamp is inserted in the queue.
Whenever a new command is issued or the total energy is requested, it is first checked whether there are entries in the queue with a timestamp less or equal to the current timestamp.
These lambdas are then first evaluated.
%
\begin{figure}
\centering
\resizebox{\linewidth}{!}{%
\input{img/implicit_commands}
}
\caption{Example for Implicit Command}
\label{fig:implicit_commands}
\end{figure}
%
\subsection{Interface Power Calculation}
%%
Physical equations from section ...,
power depends on command, address and data because the number of transmitted 0/1/toggles changes
termination power -> number of transmitted 0 and 1, efficiently calculated using population count (POPCNT) command
%
\subsection{Simulation Speed}
%
\begin{figure}
\centering
\resizebox{\linewidth}{!}{%
\input{img/benchmark_plot}
}
\caption{DRAMSys Benchmarks}
\label{fig:benchmark_plot}
\end{figure}
DRAMPower not standlone, simulated together with DRAMSys. DRAMSys is already fast (ref paper DRAMSys4.0), we have benchmarked DRAMPower coupled to DRAMSys, overhead of DRAMPower negligible.
if we couple additionally to core simulator (e.g., gem5), overhead is even smaller.
The benchmarks in figure~\ref{fig:benchmark_plot} show the overhead of drampower for a simulation with 1,000,000 requests. The benchmarks suffixed "nostore" are simulated without data. DRAMPower uses a toggling rate for calculating the databus energy.
\todo{DRAMPower popcnt. Comparison vector<bool> to std::bitset?}
\todo{Marco: Vielleicht kannst du hier ein paar Zahlen zur Simulationsgeschwindigkeit nennen, erstens bzgl. POPCNT und vielleicht auch zweitens im Vergleich zu DRAMSys, damit man sieht, dass die Simulationszeit von DRAMPower eigentlich nicht ins Gewicht fällt.}
dynamic power -> number of 0-1 toggles, calculated as (not p and q)
alternatively, duty cycle/toggling rates can be used
(drampower lässt sich unterteilen in zwei aspekte: statisch und dynamisch)
%% statisch: wie sind die versch. standards implementiert
%%% standard -> instruction set
%%% mapping von bitcode auf instruction ( 011010101 -> REF )
%%% mapping instruction auf function ( REF -> DDR5::handle_ref() )
%%% formeln zur strom berechnung
%% dynamisch: ausführung von simulation
%%% liste von instructions -> timestamp basierte simulation
%%% implizite commands
%%% sammeln von countern, berechnung, ausgabe von stromverbrauch
% Interface
% PARC
%\subsection{Modeling New Refresh Commands}
%%
%banks in refresh are considered active during refresh, device is in active mode (I\_rho + ...)
%all-bank refresh: IDD5B - IDD3N
%%
%\subsection{Core Power}
%%
%new refresh commands without specified burst refresh current, only average refresh current
%%
%\input{content/05_exp_results}
\subsection{Simulation Accuracy}
%
Interface -> comparison with SPICE, maybe use a random pattern in spice with fixed n0, n1 and alpha
Core -> we do not yet have a measurement platform for DDR5/LPDDR5/HBM3... where we can issue specific command patterns to DRAM and compare it with the results provided by DRAMPower.
\todo{Marco, Derek}
% IDD Patterns mit Daimler Messung vergleichen
In order to verify the power estimates of the new DRAMPower implementation, several measurements are performed on DRAMs from three different vendors based on a real LPDDR4 memory measurement platform~\cite{feldmann_23}.
Each DRAM is operated with six different access patterns, which are analogous to the following $I_{DD}$ currents:
\tikz{\node[circle,draw,inner sep=1pt] {\tiny 1}}~$I_{DD}0$*,
\tikz{\node[circle,draw,inner sep=1pt] {\tiny 2}}~$I_{DD}4R$,
\tikz{\node[circle,draw,inner sep=1pt] {\tiny 3}}~$I_{DD}4W$,
\tikz{\node[circle,draw,inner sep=1pt] {\tiny 4}}~$I_{DD}5AB$,
\tikz{\node[circle,draw,inner sep=1pt] {\tiny 5}}~$I_{DD}2N$ and
\tikz{\node[circle,draw,inner sep=1pt] {\tiny 6}}~$I_{DD}6$.
As it was not possible to reproduce the usual $I_{DD}0$ pattern of ACT-PRE for the measurement platform, $I_{DD}0$* is a variation using the pattern ACT-RD-PRE, which is also resembled in the DRAMPower simulation.
The initial simulations are based on the current values specified in the datasheet of the specific vendor.
Then, based on the actual measurements, the current values are reapplied to a second simulation.
The results are shown in Figure~\ref{fig:power_plot}.
% \begin{figure}
% \centering
% \resizebox{\linewidth}{!}{%
% \input{img/power_plot}
% }
% \caption{Average Power Consumption of Simulations and Measurements for Different Vendors}
% \label{fig:power_plot}
% \end{figure}
As it can be seen, the $I_{DD}$ currents in the datasheet are overly pessimistic for all vendors:
The simulations based on the datasheets show on average a $4.8\times$ higher power consumption than the actual power measurements.
However, when the measured currents are applied to the simulation, there is still a small discrepancy:
This can be explained by the fact that the measurement platform only measures the core power and not the interface power.
As DRAMPower also includes interface power estimates, it therefore reports a higher total power.
% LP4 vs LP5
% DDR4 vs. DDR5
% Vgl. DRAMPower3/4 und Vampire ggf. Messungen
% Concluison:
\section{Conclusion and Future Work}
New standards, PAM4 (GDDR6X) or PAM3 (GDDR7) instead of NRZ -> more complex interface calculation
%% mehr Standards in DRAMPower
%% drampower kann dies und das
%\section*{Acknowledgements}
%DI-DERAMSys
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Footer
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%
%% The next two lines define the bibliography style to be used, and
%% the bibliography file.
\bibliographystyle{ACM-Reference-Format}
\bibliography{drampower}
%\input{drampower-appendix}
\end{document}
\endinput