Update on Overleaf.

2024-11-14 13:25:05 +00:00
parent 228485184a
commit fac73b6272
5 changed files with 280 additions and 91 deletions
--- a/data/benchmarks.csv
+++ b/data/benchmarks.csv
@@ -0,0 +1,5 @@
+Name,CPU_Time_ms,Real_Time_ms,Iterations
+nopower-nostore,559.0857695,561.6144617524697,4
+power-nostore,774.2795032500001,777.8368692524964,4
+nopower-store,895.49864875,901.2260472518392,4
+power-store,1004.3982304999998,1009.4151812518248,4
--- a/drampower-main.tex
+++ b/drampower-main.tex
@@ -239,7 +239,7 @@ Google recently demonstrated that for large machine learning models, more than 9
 In augmented reality devices for the Metaverse, memory can account for up to 80\,\% of power consumption.\todo{quellen}
 Therefore, an accurate estimation of DRAM power consumption is critical in the early stages of design in order to properly dimension the power supply circuits and cooling.
 In mobile devices, on the other hand, the overall power budget is constrained to only a few watts.
-Nevertheless, it is equally important to accurately estimate DRAM power consumption, for example to explore the power saving potential of new DRAM standards and their additional features to extend battery life.
+Nevertheless, it is equally important to accurately estimate DRAM power consumption, for example to explore the power saving potential of new DRAM standards and their additional features to extend battery life.\cite{borgho_18}
 In the current state of the art, there are two widely used open-source simulation tools for estimating DRAM power consumption, namely \textit{DRAMPower}~\cite{kargoo_14} and \textit{CACTI-IO}~\cite{joukah_12,joukah_15}.
 DRAMPower focuses on the power consumption of the DRAM core, while CACTI-IO models the power consumption of the DRAM interface.
 Unfortunately, both tools have not been updated in recent years, so they only provide support for older DRAM standards.
@@ -261,44 +261,35 @@ The rest of the paper is structured as follows \todo{...}

 %\input{content/02_related_works}
 \section{Related Work}
-In this section we provide an overview of the related work. A well-known and often
-used DRAM power model is provided by Micron in form of a spread
-sheet~\cite{micron_ddr3_11_kopie_ipsj}. It estimates the power from data sheet
-and workload specifications (e.g. Rowbuffer-Hit-Rate or Read-Write-Ratio). However,
-this model is not accurate enough, as it assumes only certain workload
-characteristics and it is not looking on the actual executed application.
-There are further limitations in that model: Micron uses the minimal timing constrains
-from the datasheet specifications instead of the actual timings. But in
-practice there are dependencies between consecutive memory accesses so that the
-controller may accelerate or postpone commands. Furthermore, Micron assumes that
-the controller uses a close-page policy (precharge after each memory access) and
-that there is only one bank open at the same time. Due to this, a large lack of
-flexibility and accuracy exists in this model.
+In this section we provide an overview of the related work.
+A well-known and often used DRAM power model is the System Power Calculator by Micron~\cite{micron_ddr3_11_kopie_ipsj}.
+It is provided in the form of spreadsheets for various JEDEC standards including DDR/2/3/4 and LPDDR2/3/4/4X.
+The power estimation is based on data sheet currents and timings for a specific DRAM device and workload specifications like the read-write-ratio or time that the DRAM is in each state.
+However, this modeling approach can only achieve a limited accuracy because the actual command trace that is issued to the DRAM by the memory controller is not considered.
+In addition, there exist no spreadsheets for current generation standards.
+%However, this model is not accurate enough, as it assumes only certain workload characteristics and it is not looking on the actual executed application. There are further limitations in that model: Micron uses the minimal timing constrains from the datasheet specifications instead of the actual timings. 
+%But in practice there are dependencies between consecutive memory accesses so that the controller may accelerate or postpone commands. Furthermore, Micron assumes that the controller uses a close-page policy (precharge after each memory access) and that there is only one bank open at the same time. Due to this, a large lack of flexibility and accuracy exists in this model.
 %
-A better model is DRAMPower, by Chandrasekar et al.~\cite{kargoo_14} and it's
-extensions~\cite{junmat_16b}, which uses the actual timings from the executed
-DRAM transactions. 
-%
-Another Power model, similar to DRAMPower is called Vampire~\cite{ghoyag_18},
-which also accounts for variations in module and data value dependency.
-%
-This paper presents a complete version of DRAMPower including a new and
-efficient simulation kernel, as well as all modern JEDEC standards. 
-\todo{CACTI-IO~\cite{joukah_12,joukah_15}, CACTI 7~\cite{balkah_17}, NVSim~\cite{donxu_12}}
-%% dramsys?
-%% gibt es sonst was?
-% DRAMPower3/4
-%% Vampire
-% Vampire ist schön und gut, wenn man die Messungergebnisse für ein bestimmtes Device vorliegen hat. In den meisten Fällen ist das aber nicht der Fall.
-% Fraglich ist, ob die Datenabhängigkeit und strukturelle Abhängigkeit der Core-Power wirklich so groß ist wie im Paper berichtet.
-% Was Vampire nicht betrachtet sind verschiedene Konfigurationen für das Interface -> PCB, Targer und NT Terminierung, etc.
-% Micron Excel-Sheet
-% CACTI-IO
-
+A more accurate simulation tool is DRAMPower~\cite{kargoo_14} , which also relies on data sheet values, but in addition uses a real DRAM command trace as input to model the internal state transitions with cycle accuracy.
+However, the internal DRAM states are simplified and the power dependence on the number of active DRAM banks is not considered.
+Thus, DRAMPower was enhanced with a bank-sensitive model in~\cite{junmat_16b,matzul_17} to improve its accuracy. 
+Still, the tool has two drawbacks: it only models core power, but no interface power, and it has not been updated to the latest standards yet.
+Another simulation tools similar to DRAMPower is VAMPIRE~\cite{ghoyag_18}.
+This tool puts its focus on the power variations between different DRAM modules, within one DRAM module depending on the access location, and the data value dependency.
+VAMPIRE is calibrated with measurements of real DRAM modules and provides very accurate results.
+However, this presupposes that real measurements are available for the devices to be used, which is not usually the case in the early stages of design.
+VAMPIRE also supports DDR3 only.
+\todo{analytical core power model Vogelsang, highly proprietary IP}
+When it comes to DRAM interface power modeling, the most popular software is CACTI-IO~\cite{joukah_12,joukah_15}.
+CACTI-IO does not rely on data sheet currents, but it uses an equivalent circuit diagram of the DRAM subsystem's real interface architecture as this architecture is not fixed for a specific device. 
+The power consumption is then calculated using a simplified network analysis.
+While this approach leads to accurate results for older generation DRAM standards, the error introduced by the simplifications is significantly higher for current generation standards as they support much higher data rates.
+In summary, there is no publicly available DRAM power simulation tool capable of modeling both core and interface power of current generation DRAM standards with high accuracy.
 %\input{content/03_overview}
 \section{DRAM Background}
 %
-This section provides the necessary background on the DRAM core and interface between memory controller and DRAM device that is relevant for power modeling.
+This section provides the necessary background on the DRAM core and interface that is relevant for power modeling.
+It also briefly introduces the different families of DRAM standards and explains their main differences.
 %
 \subsection{Core}
 %
@@ -306,10 +297,11 @@ DRAM is a type of memory mainly optimized for a low cost per bit.
 To achieve a high storage density, the chips are internally organized in a hierarchical fashion consisting of columns, rows, banks and for newer standards bank groups. 
 When data should be read or written from or to a column, the corresponding row must be activated first.
 Within each bank, only one row can be active at a time and the bank must be precharged before a new row can be activated.
-DRAM uses electrical charge held in a tiny capacitor to store information.
-As the capacitor leaks the charge over time, each DRAM cell must be refreshed regularly (usually every 32 to 64\,ms).
+Data is transferred over the interface in a burst fashion, i.e. for a read operation, a large amount of data is fetched internally in parallel from the array to the interface, which is then transferred to the memory controller in multiple beats (usually 8 or 16).
+Information is stored as an electrical charge held in a tiny capacitor.
+As the capacitor leaks this charge over time, each DRAM cell must be refreshed regularly (usually every 32 to 64\,ms).
 The refresh operation is triggered externally by the memory controller with a refresh command.
-During refresh, no data can be accessed.
+During refresh, no data can be accessed within the target bank(s).
 Thus, only a few rows are refreshed each time to avoid long access delays and a refresh command is sent every few microseconds.
 To save energy, DRAM devices can be put into a power down mode when no data accesses are performed.
 This disables parts of the core and interface.
@@ -335,7 +327,7 @@ One connection type widely used in PCs and servers is the dual inline memory mod
 Multiple DRAM chips are soldered onto a small PCB with pins on the bottom edge, which is then plugged into a socket on the main PCB.
 DIMMs require special considerations for power modeling as there are different wiring topologies, off-die termination schemes and in some cases additional buffer chips for the command/address bus and data bus.
 %
-\section{Overview of JEDEC DRAM Standards}
+\subsection{DRAM Standards}
 %
 \todo{Special features see Luizas Master Thesis, e.g., DBI, write X, new refresh modes etc.}
 Over the last quarter century, JEDEC has published more than 20 different DRAM standards.
@@ -390,11 +382,11 @@ DDR5 UDIMM: Fly-By , other DIMMs: LRDIMM, RDIMM, SODIMM, CUDIMM (clocked unbuffe
 %
 \section{Core Power Modeling}
 %
-This section explains the modeling of core power, while the modeling of interface power is topic of the next section.
-Core and interface can be considered completely independent of each other because they use different supply voltages.
+This section explains the modeling of core power, while the modeling of interface power is covered in the next section.
+Core and interface can be considered completely independent of each other because they always use different supply voltages.
 Core power refers to the power consumed by the internal circuitry of the DRAM device, i.e., the memory arrays, sense amplifiers, row and column decoders, I/O gating and control logic.
-The receiver circuits at the interface are also operated with the core supply voltage and are therefore part of the core power.
-As the internal architecture of modern DRAM devices is very complex and highly proprietary, core power calculation cannot be based on circuit analysis.
+The receiver circuits at the interface are also operated with the core supply voltage and are therefore included in the core power.
+As the internal architecture of modern DRAM devices is very complex and highly proprietary, core power calculation cannot be based on network analysis.
 However, each DRAM standard defines a set of currents for fixed operating scenarios, which are listed in vendor datasheets.
 Based on these currents, the core power can be estimated. 
 %%%%
@@ -424,7 +416,7 @@ The minimum set specified in all DRAM standards includes the following nine curr
 \end{itemize}
 %
 Unfortunately, the different JEDEC subcommittees, which are responsible for formulating DRAM standards, are very inconsistent in specifying the currents.
-Apart from different naming schemes, the measurement conditions mentioned above only apply for standards of the DDR family, while they differ for LPDDR, GDDR and HBM.
+Apart from different naming schemes\footnote{To avoid confusion, we use our own naming scheme, which is a mixture of several standards.}, the measurement conditions mentioned above only apply for standards of the DDR family, while they differ for LPDDR, GDDR and HBM.
 For example, LPDDR measures $I_{DD3N}$, $I_{DD3P}$, $I_{DD4R}$ and $I_{DD4W}$ with only one bank active. 
 GDDR measures $I_{DD3N}$ and $I_{DD3P}$ with one bank active, while $I_{DD4R}$ and $I_{DD4W}$ are measured with one bank in each bank group active. 
 HBM, in turn, measures $I_{DD3N}$ and $I_{DD3P}$ with one bank active and $I_{DD4R}$ as well as $I_{DD4W}$ with all banks active.
@@ -432,10 +424,9 @@ Section~\ref{subsec:bankwise} explains how these different measurement condition
 Similarly, the refresh currents are also measured under various conditions. 
 While DDR standards specify a burst refresh current $I_{DD5B}$ for all available refresh modes, LPDDR standards specify a burst refresh current only for all-bank refresh, while for per-bank refresh, an average current $I_{DD5A}$ is provided. 
 The difference between $I_{DD5B}$ and $I_{DD5A}$ is the spacing between two consecutive refresh commands.
-It is the refresh cycle time $t_{RFC}$ (i.e., the duration of a single refresh operation) for $I_{DD5B}$ and the average refresh interval $t_{REFI}$ (i.e., the interval at which refresh commands need to be issued in normal operation) for $I_{DD5A}$.
-GDDR5, GDDR5X, GDDR6, HBM1 and HBM2 do not specify a current for per-bank refresh at all although they support it, while HBM3 specifies a burst refresh current both for all-bank and per-bank refresh. 
+It is the refresh cycle time $t_{RFC}$ (i.e., the duration of a single refresh operation) for $I_{DD5B}$ and the much longer average refresh interval $t_{REFI}$ (i.e., the interval at which refresh commands need to be issued in normal operation) for $I_{DD5A}$.
+GDDR5/5X/6 and HBM1/2 do not specify a current for per-bank refresh at all although they support it, while HBM3 specifies a burst refresh current both for all-bank and per-bank refresh. 
 Section~\ref{subsec:refresh} shows how refresh power can be modeled using the provided currents of each standard.
-\todo{introduce pb refresh earlier, say that abbreviations in the paper do not always match abbreviations in standard}
 \todo{last subsection? extra features, maybe future work?}
 \todo{multiple supply voltages!}

@@ -459,7 +450,11 @@ $\rho$ is a vendor- and device-specific factor between 0 and 1, which can be det
 Alternatively, the pessimistic assumption of $\rho = 1$ can be made, which leads to the simplified model with only two distinct states as used by Micron~\cite{micron_ddr3_11_kopie_ipsj}.
 For standards of the DDR family, it is $I_{DD3N} = I_{\circled{B}}$, while for LPDDR, GDDR and HBM, it is $I_{DD3N} = I_{\circled{1}}$. 
 This difference must be taken into account when calculating the background power.
-\todo{Formel}
+If the current $I_{DD2N}$, the factor $\rho$, a current $I_{DD3N}$ measured with N banks active, and the total number of banks $B$ is given, all other currents can be calculated. 
+It is
+\begin{equation}
+    I_{\circled{N}} = I_{DD2N} + (I_{\circled{B}} - I_{DD2N}) \cdot \left(\rho + (1-\rho)\cdot \frac{N}{B}\right)
+\end{equation}
 When the DRAM is in power-down mode, the dependence of the current on the number of active banks is much smaller, so we only distinguish between two states characterized by $I_{DD2P}$ and $I_{DD3P}$.

 The average command power is calculated by counting the number of commands of each type, adding up the energy that is consumed for all these commands, and dividing the total energy by the simulated time. 
@@ -473,7 +468,6 @@ For a write command, $I_{DD4R}$ is replaced with $I_{DD4W}$.
 However, this equation only works if $I_{DD4R}$ and $I_{DD3N}$ are measured with the same number of banks active, which is not the case for GDDR and HBM.
 Thus, the equations need to be adapted accordingly, i.e., for GDDR, $I_{DD3N}$ must be replaced with $I_{\circled{BG}}$ with $BG$ being the number of bank groups, while for HBM, $I_{DD3N}$ must be replaced with $I_{\circled{B}}$.
 %
-\todo{introduce burst length earlier}
 %
 %\subsection{Current Measurement Conditions}
 %%
@@ -520,6 +514,21 @@ Thus, the equations need to be adapted accordingly, i.e., for GDDR, $I_{DD3N}$ m
 %
 \subsection{Refresh Power}\label{subsec:refresh}
 %
+Depending on the DRAM standard, various refresh modes are supported.
+They differ in the number of banks that are refreshed with a single command.
+All-bank refresh commands target all banks of the device at once. 
+As no data can be accessed in banks where a refresh is in progress, this mode can cause a large drop in bandwidth.
+Thus, newer DRAM standards offer improved refresh modes where only a single bank (per-bank refresh), two banks (per-2-bank refresh) or one bank in each bank group (same-bank refresh) of the device are targeted with a single command, while the remaining banks can still be accessed in the meantime.
+The duration of a single refresh command is the refresh cycle time $t_{RFC}$, which is also the spacing of refresh commands when measuring the burst refresh current $I_{DD5B}$.
+Thus, when a burst refresh current is provided, the energy for a single refresh command $E_{REF}$ can be calculated as
+\begin{equation}
+    E_{REF} = V_{DD} \cdot \left(I_{DD5B} - I_{\circled{N}}\right) \cdot t_{RFC}
+\end{equation}
+
+
+
+During refresh, the targeted banks are considered active because 
+
 As explained in Section~\ref{subsec:current_measurement}, JEDEC 
 %
 \begin{figure}
@@ -559,6 +568,7 @@ Same-bank refresh for device with \textit{BG} bank groups and \textit{BA} banks
 %
 Interface power refers to the power consumed by the drivers for the communication between memory controller and DRAM devices.
 In contrast to the core power, which is fixed for a specific device, the interface power depends on the complete DRAM subsystem architecture, i.e., the physical layer (PHY) of the memory controller, the channel architecture (e.g., number of ranks) , the channel characteristics (e.g., channel loss and parasitic capacitances) and the DRAM PHYs.
+\todo{modeling based on currents not possible, moreover, currents measured with ODT disabled}
 It can be divided into two parts:
 %
 \begin{itemize}
@@ -736,33 +746,58 @@ As an example, Figure~\ref{fig:terminations} shows the two equivalent circuit di
 %
 \begin{figure}
    \centering
-    \begin{subfigure}[b]{0.49\linewidth}
-    \centering
-    \resizebox{\linewidth}{!}{%
-            \begin{circuitikz}
-                \ctikzset{bipoles/resistor/height=0.15}
-                \ctikzset{bipoles/resistor/width=0.4}
-                \draw (0,0) node[tground](VDDQ1){} to[R=$R_{ON}$] ++(0,-1.5) to[short=$"1"$] ++(3,0) to[R,a=$R_{TT}$] ++(0,1.5) node[tground](VDDQ2){};
-                \node[anchor=south] at (VDDQ1) {$V_{DDQ}$};
-                \node[anchor=south] at (VDDQ2) {$V_{DDQ}$};
-            \end{circuitikz}}
-    \caption{Driving Logic "1"}
-    \label{fig:term_logic_1}
-    \end{subfigure}
-    %
-    \begin{subfigure}[b]{0.49\linewidth}
-    \centering
-    \resizebox{\linewidth}{!}{%
-            \begin{circuitikz}
-                \ctikzset{bipoles/resistor/height=0.15}
-                \ctikzset{bipoles/resistor/width=0.4}
-                \draw (0,0) node[tlground]{} to[R,a=$R_{ON}$] ++(0,1.5) to[short=$"0"$] ++(3,0) to[R,a=$R_{TT}$] ++(0,1.5) node[tground](VDDQ){};
-                \node[anchor=south] at (VDDQ) {$V_{DDQ}$};
-            \end{circuitikz}}
-    \caption{Driving Logic "0"}
-    \label{fig:term_logic_0}
-    \end{subfigure}
-    \caption{Equivalent Circuit Diagrams for PODL Termination Power\todo{top alignment}}
+    \begin{circuitikz}
+        \ctikzset{bipoles/resistor/height=0.15}
+        \ctikzset{bipoles/resistor/width=0.4}
+        \draw (0,0)
+            node[tground](VDDQ1){}
+                to [R=$R_{ON}$] ++(0,-1.5) coordinate(x1)
+                to [short=$"1"$, name={s1}] ++(2,0) coordinate(x2)
+                to [R,a=$R_{TT}$] ++(0,1.5) node[tground](VDDQ2){};
+        \node[anchor=south] at (VDDQ1) {$V_{DDQ}$};
+        \node[anchor=south] at (VDDQ2) {$V_{DDQ}$};
+        \draw(x2) to [open] ++(1.5,0) coordinate(x3) 
+            to [R=$R_{ON}$] ++(0,-1.0) node[ground](x4){};
+        \draw(x3)
+            to [short=$"0"$, name={s2}] ++(2,0)
+            to [R,a=$R_{TT}$] ++(0,1.5) node[tground](VDDQ3){};
+        \node[anchor=south] at (VDDQ3) {$V_{DDQ}$};
+        \path(x4) ++(0,-0.75) coordinate(x5);
+        \draw(s1|-x5) node[](){\bfseries (a) Driving Logic "1"};
+        \draw(s2|-x5) node[](){\bfseries (b) Driving Logic "0"};
+    \end{circuitikz}%
+    %\begin{subfigure}[t]{0.49\linewidth}
+    %\centering
+    %\resizebox{\linewidth}{!}{%
+    %        \begin{circuitikz}
+    %            \ctikzset{bipoles/resistor/height=0.15}
+    %            \ctikzset{bipoles/resistor/width=0.4}
+    %            \draw (0,0)
+    %                node[tground](VDDQ1){}
+    %                    to [R=$R_{ON}$] ++(0,-1.5) coordinate(foo)
+    %                    to [short=$"1"$] ++(3,0)
+    %                    to [R,l=$R_{TT}$] ++(0,1.5) node[tground](VDDQ2){};
+    %            \node[anchor=south] at (VDDQ1) {$V_{DDQ}$};
+    %            \node[anchor=south] at (VDDQ2) {$V_{DDQ}$};
+    %            \draw[white](foo) to [R] ++(0,-1.0) node[ground](VDDQ){};
+    %        \end{circuitikz}}
+    %\caption{Driving Logic "1"}
+    %\label{fig:term_logic_1}
+    %\end{subfigure}
+    %%
+    %\begin{subfigure}[t]{0.49\linewidth}
+    %\centering
+    %\resizebox{\linewidth}{!}{%
+    %        \begin{circuitikz}
+    %            \ctikzset{bipoles/resistor/height=0.15}
+    %            \ctikzset{bipoles/resistor/width=0.4}
+    %            \draw (0,0) node[tlground]{} to[R,a=$R_{ON}$] ++(0,1.5) to[short=$"0"$] ++(3,0) to[R,l=$R_{TT}$] ++(0,1.5) node[tground](VDDQ){};
+    %            \node[anchor=south] at (VDDQ) {$V_{DDQ}$};
+    %        \end{circuitikz}}
+    %\caption{Driving Logic "0"}
+    %\label{fig:term_logic_0}
+    %\end{subfigure}
+    \caption{Equivalent Circuit Diagrams for PODL Termination Power}
    \label{fig:terminations}
 \end{figure}
 %
@@ -1076,6 +1111,7 @@ Finally, the switching activity $\alpha$ can be determined by counting the numbe
 %
 \section{Simulator Architecture}
 %
+No standalone simulator, but coupled to e.g. DRAMSys
 \todo{ranks}
 \todo{count 1, 0 and 0->1 based on issued commands and data, alternatively use average values}
 \todo{count commands and clock cycles in each state for background power}
@@ -1137,7 +1173,24 @@ These lambdas are then first evaluated.
 Physical equations from section ..., 
 power depends on command, address and data because the number of transmitted 0/1/toggles changes
 termination power -> number of transmitted 0 and 1, efficiently calculated using population count (POPCNT) command
+%
 \subsection{Simulation Speed}
+%
+
+\begin{figure}
+    \centering
+    \resizebox{\linewidth}{!}{%
+    \input{img/benchmark_plot}
+    }
+    \caption{DRAMSys Benchmarks}
+    \label{fig:benchmark_plot}
+\end{figure}
+
+
+DRAMPower not standlone, simulated together with DRAMSys. DRAMSys is already fast (ref paper DRAMSys4.0), we have benchmarked DRAMPower coupled to DRAMSys, overhead of DRAMPower negligible.
+if we couple additionally to core simulator (e.g., gem5), overhead is even smaller.
+The benchmarks in figure~\ref{fig:benchmark_plot} show the overhead of drampower for a simulation with 1,000,000 requests. The benchmarks suffixed "nostore" are simulated without data. DRAMPower uses a toggling rate for calculating the databus energy.
+\todo{DRAMPower popcnt. Comparison vector<bool> to std::bitset?}
 \todo{Marco: Vielleicht kannst du hier ein paar Zahlen zur Simulationsgeschwindigkeit nennen, erstens bzgl. POPCNT und vielleicht auch zweitens im Vergleich zu DRAMSys, damit man sieht, dass die Simulationszeit von DRAMPower eigentlich nicht ins Gewicht fällt.}
 dynamic power -> number of 0-1 toggles, calculated as (not p and q)
 alternatively, duty cycle/toggling rates can be used
@@ -1169,17 +1222,36 @@ alternatively, duty cycle/toggling rates can be used

 %\input{content/05_exp_results}
 \subsection{Simulation Accuracy}
+%
+Interface -> comparison with SPICE, maybe use a random pattern in spice with fixed n0, n1 and alpha
+Core -> we do not yet have a measurement platform for DDR5/LPDDR5/HBM3... where we can issue specific command patterns to DRAM and compare it with the results provided by DRAMPower.
 \todo{Marco, Derek}
 % IDD Patterns mit Daimler Messung vergleichen 
-
-\begin{figure}
-    \centering
-    \resizebox{\linewidth}{!}{%
-    \input{img/power_plot_hynix}
-    }
-    \caption{Average Power Consumption of Simulations and Measurements for Different Vendors}
-    \label{fig:power_plot_hynix}
-\end{figure}
+In order to verify the power estimates of the new DRAMPower implementation, several measurements are performed on DRAMs from three different vendors based on a real LPDDR4 memory measurement platform~\cite{feldmann_23}.
+Each DRAM is operated with six different access patterns, which are analogous to the following $I_{DD}$ currents:
+\tikz{\node[circle,draw,inner sep=1pt] {\tiny 1}}~$I_{DD}0$*,
+\tikz{\node[circle,draw,inner sep=1pt] {\tiny 2}}~$I_{DD}4R$,
+\tikz{\node[circle,draw,inner sep=1pt] {\tiny 3}}~$I_{DD}4W$,
+\tikz{\node[circle,draw,inner sep=1pt] {\tiny 4}}~$I_{DD}5AB$,
+\tikz{\node[circle,draw,inner sep=1pt] {\tiny 5}}~$I_{DD}2N$ and
+\tikz{\node[circle,draw,inner sep=1pt] {\tiny 6}}~$I_{DD}6$.
+As it was not possible to reproduce the usual $I_{DD}0$ pattern of ACT-PRE for the measurement platform, $I_{DD}0$* is a variation using the pattern ACT-RD-PRE, which is also resembled in the DRAMPower simulation.
+The initial simulations are based on the current values specified in the datasheet of the specific vendor.
+Then, based on the actual measurements, the current values are reapplied to a second simulation.
+The results are shown in Figure~\ref{fig:power_plot}.
+%\begin{figure}
+%    \centering
+%    \resizebox{\linewidth}{!}{%
+%    \input{img/power_plot}
+%    }
+%    \caption{Average Power Consumption of Simulations and Measurements for Different Vendors}
+%    \label{fig:power_plot}
+%\end{figure}
+As it can be seen, the $I_{DD}$ currents in the datasheet are overly pessimistic for all vendors:
+The simulations based on the datasheets show on average a $4.8\times$ higher power consumption than the actual power measurements.
+However, when the measured currents are applied to the simulation, there is still a small discrepancy:
+This can be explained by the fact that the measurement platform only measures the core power and not the interface power.
+As DRAMPower also includes interface power estimates, it therefore reports a higher total power.

 % LP4 vs LP5
 % DDR4 vs. DDR5
--- a/drampower.bib
+++ b/drampower.bib
@@ -142,3 +142,58 @@ year={1998}
  year={1990},
  publisher={Addison-Wesley Publishing Company}
 }
+
+@article{10.1145/3296957.3173177,
+author = {Boroumand, Amirali and Ghose, Saugata and Kim, Youngsok and Ausavarungnirun, Rachata and Shiu, Eric and Thakur, Rahul and Kim, Daehyun and Kuusela, Aki and Knies, Allan and Ranganathan, Parthasarathy and Mutlu, Onur},
+title = {Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks},
+year = {2018},
+issue_date = {February 2018},
+publisher = {Association for Computing Machinery},
+address = {New York, NY, USA},
+volume = {53},
+number = {2},
+issn = {0362-1340},
+url = {https://doi.org/10.1145/3296957.3173177},
+doi = {10.1145/3296957.3173177},
+abstract = {We are experiencing an explosive growth in the number of consumer devices, including smartphones, tablets, web-based computers such as Chromebooks, and wearable devices. For this class of devices, energy efficiency is a first-class concern due to the limited battery capacity and thermal power budget. We find that data movement is a major contributor to the total system energy and execution time in consumer devices. The energy and performance costs of moving data between the memory system and the compute units are significantly higher than the costs of computation. As a result, addressing data movement is crucial for consumer devices. In this work, we comprehensively analyze the energy and performance impact of data movement for several widely-used Google consumer workloads: (1) the Chrome web browser; (2) TensorFlow Mobile, Google's machine learning framework; (3) video playback, and (4) video capture, both of which are used in many video services such as YouTube and Google Hangouts. We find that processing-in-memory (PIM) can significantly reduce data movement for all of these workloads, by performing part of the computation close to memory. Each workload contains simple primitives and functions that contribute to a significant amount of the overall data movement. We investigate whether these primitives and functions are feasible to implement using PIM, given the limited area and power constraints of consumer devices. Our analysis shows that offloading these primitives to PIM logic, consisting of either simple cores or specialized accelerators, eliminates a large amount of data movement, and significantly reduces total system energy (by an average of 55.4\% across the workloads) and execution time (by an average of 54.2\%).},
+journal = {SIGPLAN Not.},
+month = mar,
+pages = {316–331},
+numpages = {16},
+keywords = {consumer workloads, data movement, energy efficiency, memory systems, processing-in-memory}
+}
+
+@inproceedings{borgho_18,
+author = {Boroumand, Amirali and Ghose, Saugata and Kim, Youngsok and Ausavarungnirun, Rachata and Shiu, Eric and Thakur, Rahul and Kim, Daehyun and Kuusela, Aki and Knies, Allan and Ranganathan, Parthasarathy and Mutlu, Onur},
+title = {Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks},
+year = {2018},
+isbn = {9781450349116},
+publisher = {Association for Computing Machinery},
+address = {New York, NY, USA},
+url = {https://doi.org/10.1145/3173162.3173177},
+doi = {10.1145/3173162.3173177},
+booktitle = {Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems},
+pages = {316–331},
+numpages = {16},
+keywords = {consumer workloads, data movement, energy efficiency, memory systems, processing-in-memory},
+location = {Williamsburg, VA, USA},
+series = {ASPLOS '18}
+}
+
+
+@inproceedings{feldmann_23,
+	location = {Alexandria {VA} {USA}},
+	title = {A Precise Measurement Platform for {LPDDR}4 Memories},
+	isbn = {9798400716447},
+	url = {https://dl.acm.org/doi/10.1145/3631882.3631899},
+	doi = {10.1145/3631882.3631899},
+	eventtitle = {{MEMSYS} '23: The International Symposium on Memory Systems},
+	pages = {1--8},
+	booktitle = {Proceedings of the International Symposium on Memory Systems},
+	publisher = {{ACM}},
+	author = {Feldmann, Johannes and Steiner, Lukas and Christ, Derek and Psota, Thomas and Jung, Matthias and Wehn, Norbert},
+	urldate = {2024-11-14},
+	date = {2023-10-02},
+	langid = {english},
+	file = {Feldmann et al. - 2023 - A Precise Measurement Platform for LPDDR4 Memories.pdf:/home/derek/.local/share/zotero/storage/C7RSPK9K/Feldmann et al. - 2023 - A Precise Measurement Platform for LPDDR4 Memories.pdf:application/pdf},
+}
--- a/img/benchmark_plot.tex
+++ b/img/benchmark_plot.tex
@@ -0,0 +1,57 @@
+% \begin{tikzpicture}
+%     \begin{axis}[
+%         ybar,
+%         symbolic x coords={nopower-nostore, power-nostore, nopower-store, power-store},
+%         xtick=data,
+%         xlabel={Benchmark},
+%         ylabel={CPU Time [ms]},
+%         nodes near coords,
+%         ymin=0,
+%         width=15cm,
+%         height=5cm,
+%         bar width=20pt,
+%         enlarge x limits=0.2,
+%         ytick=\empty,
+%         axis line style={-} ,
+%         ymin=0,
+%         enlarge y limits={value=0.3,upper},
+%         % every node near coord/.append style={yshift=-0.4cm, xshift=23pt},
+%     ]
+%     \addplot[
+%         color=blue,
+%         fill=blue!40
+%     ] table[
+%         x=Name,
+%         y=CPU_Time_ms,
+%         col sep=comma
+%     ] {data/benchmarks.csv};
+%     \end{axis}
+% \end{tikzpicture}
+\pgfplotsset{compat=1.3}
+\begin{tikzpicture}
+    \begin{axis}[
+        xbar,
+        symbolic y coords={nopower-nostore, power-nostore, nopower-store, power-store},
+        xtick=data,
+        ylabel={Benchmark},
+        xlabel={CPU Time [ms]},
+        nodes near coords,
+        width=8cm,
+        height=4cm,
+        bar width=8pt,
+        enlarge y limits=0.2,
+        enlarge x limits=0.5,
+        xtick=\empty,
+        axis line style={-} ,
+        ytick style={draw=none},
+    ]
+    \addplot[
+        color=blue,
+        fill=blue!40
+    ] table[
+        y=Name,
+        x=CPU_Time_ms,
+        col sep=comma
+    ] {data/benchmarks.csv};
+    \end{axis}
+\end{tikzpicture}
--- a/img/power_plot_hynix.tex
+++ b/img/power_plot_hynix.tex
@@ -92,12 +92,12 @@
    % Legend
    \begin{scope}
        \draw[green!50,line width=0.9pt] (-0.15, -8) -- (0.15, -8);
-        \node[anchor=west] at (0.2,-8) {Datasheet IDDs};
+        \node[anchor=west] at (0.2,-8) {Sim. Datasheet IDDs};
+
+        \draw[red!50,line width=0.9pt] (3.4-0.15, -8) -- (3.4+0.15, -8);
+        \node[anchor=west] at (3.6,-8) {Sim. Measured IDDs};
        
        \draw[blue!50,line width=0.9pt] (-0.15, -8.5) -- (0.15, -8.5);
-        \node[anchor=west] at (0.2,-8.5) {Measurements};
-
-        \draw[red!50,line width=0.9pt] (3.8-0.15, -8) -- (3.8+0.15, -8);
-        \node[anchor=west] at (4.0,-8) {Real IDDs};
+        \node[anchor=west] at (0.2,-8.5) {Measurement};
    \end{scope}
 \end{tikzpicture}%