964 lines
61 KiB
TeX
964 lines
61 KiB
TeX
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||
%% %%
|
||
%% Please do not use \input{...} to include other tex files. %%
|
||
%% Submit your LaTeX manuscript as one .tex document. %%
|
||
%% %%
|
||
%% All additional figures and files should be attached %%
|
||
%% separately and not embedded in the \TeX\ document itself. %%
|
||
%% %%
|
||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||
|
||
% see https://www.springer.com/journal/10766/submission-guidelines#Instructions%20for%20Authors_Title%20Page for submission guidelines
|
||
|
||
%%\documentclass[referee,sn-basic]{sn-jnl}% referee option is meant for double line spacing
|
||
|
||
%%=======================================================%%
|
||
%% to print line numbers in the margin use lineno option %%
|
||
%%=======================================================%%
|
||
|
||
%%\documentclass[lineno,sn-basic]{sn-jnl}% Basic Springer Nature Reference Style/Chemistry Reference Style
|
||
|
||
%%======================================================%%
|
||
%% to compile with pdflatex/xelatex use pdflatex option %%
|
||
%%======================================================%%
|
||
|
||
%%\documentclass[pdflatex,sn-basic]{sn-jnl}% Basic Springer Nature Reference Style/Chemistry Reference Style
|
||
|
||
% necessary hack to load tikz because Springer Nature uses the "program" package which results in errors
|
||
% see https://tex.stackexchange.com/a/615043
|
||
\RequirePackage[dvipsnames]{xcolor}
|
||
\RequirePackage{tikz}
|
||
|
||
%%\documentclass[sn-basic]{sn-jnl}% Basic Springer Nature Reference Style/Chemistry Reference Style
|
||
\documentclass[sn-mathphys]{sn-jnl}% Math and Physical Sciences Reference Style
|
||
%%\documentclass[sn-aps]{sn-jnl}% American Physical Society (APS) Reference Style
|
||
%%\documentclass[sn-vancouver]{sn-jnl}% Vancouver Reference Style
|
||
%%\documentclass[sn-apa]{sn-jnl}% APA Reference Style
|
||
%%\documentclass[sn-chicago]{sn-jnl}% Chicago-based Humanities Reference Style
|
||
%%\documentclass[sn-standardnature]{sn-jnl}% Standard Nature Portfolio Reference Style
|
||
%%\documentclass[default]{sn-jnl}% Default
|
||
%%\documentclass[default,iicol]{sn-jnl}% Default with double column layout
|
||
|
||
%%%% Standard Packages
|
||
\usepackage[dvipsnames]{xcolor}
|
||
\newcommand\todo[1]{\textcolor{red}{#1}}
|
||
\newcommand\new[1]{\textcolor{blue}{#1}}
|
||
|
||
\usepackage{graphicx}
|
||
\usepackage{tabularray}
|
||
\usepackage{siunitx}
|
||
\DeclareSIUnit\transfer{T}
|
||
\sisetup{per-mode = symbol}
|
||
|
||
\usepackage{amsmath}
|
||
\usepackage{ifthen}
|
||
|
||
%\usepackage{tikz}
|
||
\usetikzlibrary{positioning}
|
||
\usetikzlibrary{backgrounds}
|
||
\usetikzlibrary{arrows.meta}
|
||
\usepackage{subcaption}
|
||
|
||
\usepackage{minted}
|
||
\definecolor{LightGray}{gray}{0.9}
|
||
|
||
\usepackage{pgfplots}
|
||
\pgfplotsset{compat=1.9}
|
||
\usepackage{circuitikz}
|
||
\usetikzlibrary{fit}
|
||
\usetikzlibrary{calc}
|
||
|
||
\lstset{
|
||
literate={~} {$\sim$}{1}
|
||
}
|
||
|
||
%\usepackage[hidelinks]{hyperref} --> bereits in template geladen
|
||
%%%%
|
||
|
||
%%%%%=============================================================================%%%%
|
||
%%%% Remarks: This template is provided to aid authors with the preparation
|
||
%%%% of original research articles intended for submission to journals published
|
||
%%%% by Springer Nature. The guidance has been prepared in partnership with
|
||
%%%% production teams to conform to Springer Nature technical requirements.
|
||
%%%% Editorial and presentation requirements differ among journal portfolios and
|
||
%%%% research disciplines. You may find sections in this template are irrelevant
|
||
%%%% to your work and are empowered to omit any such section if allowed by the
|
||
%%%% journal you intend to submit to. The submission guidelines and policies
|
||
%%%% of the journal take precedence. A detailed User Manual is available in the
|
||
%%%% template package for technical guidance.
|
||
%%%%%=============================================================================%%%%
|
||
|
||
\jyear{2022}%
|
||
|
||
\raggedbottom
|
||
%%\unnumbered% uncomment this for unnumbered level heads
|
||
|
||
\begin{document}
|
||
|
||
\title[Split'n'Cover: ISO\,26262 Hardware Safety Analysis with SystemC]{Split'n'Cover: ISO\,26262 Hardware Safety Analysis with SystemC}
|
||
|
||
%%=============================================================%%
|
||
%% Prefix -> \pfx{Dr}
|
||
%% GivenName -> \fnm{Joergen W.}
|
||
%% Particle -> \spfx{van der} -> surname prefix
|
||
%% FamilyName -> \sur{Ploeg}
|
||
%% Suffix -> \sfx{IV}
|
||
%% NatureName -> \tanm{Poet Laureate} -> Title after name
|
||
%% Degrees -> \dgr{MSc, PhD}
|
||
%% \author*[1,2]{\pfx{Dr} \fnm{Joergen W.} \spfx{van der} \sur{Ploeg} \sfx{IV} \tanm{Poet Laureate}
|
||
%% \dgr{MSc, PhD}}\email{iauthor@gmail.com}
|
||
%%=============================================================%%
|
||
|
||
\author*[1]{\fnm{Lukas} \sur{Steiner}}\email{lukas.steiner@rptu.de}
|
||
\author[1]{\fnm{Kira} \sur{Kraft}}\email{kira.kraft@rptu.de}
|
||
\author[2]{\fnm{Derek} \sur{Christ}}\email{derek.christ@iese.fraunhofer.de}
|
||
\author[2]{\fnm{Denis} \sur{Uecker}}\email{denis.uecker@iese.fraunhofer.de}
|
||
\author[2]{\fnm{Christian} \sur{Malek}}\email{christian.malek@iese.fraunhofer.de}
|
||
\author[2,3]{\fnm{Matthias} \sur{Jung}}\email{matthias.jung@iese.fraunhofer.de}
|
||
\author[1]{\fnm{Norbert}~\sur{Wehn}}\email{norbert.wehn@rptu.de}
|
||
|
||
\affil[1]{\orgdiv{Microelectronics Systems Design Research Group}, \orgname{RPTU Kaiserslautern-Landau}, \orgaddress{\street{Erwin-Schrödinger-Straße 12}, \city{Kaiserslautern}, \postcode{67663}, \state{Rhineland-Palatinate}, \country{Germany}}}
|
||
|
||
\affil[2]{\orgdiv{Embedded Systems}, \orgname{Fraunhofer IESE}, \orgaddress{\street{Fraunhofer-Platz 1}, \city{Kaiserslautern}, \postcode{67663}, \state{Rhineland-Palatinate}, \country{Germany}}}
|
||
|
||
\affil[3]{\orgdiv{Embedded Systems}, \orgname{HTW Saar}, \orgaddress{\street{Goebenstraße 40}, \city{Saarbrücken}, \postcode{66117}, \state{Saarland}, \country{Germany}}}
|
||
|
||
\abstract{
|
||
The development of safe hardware is a major concern in automotive applications. \new{Especially the advent of consumer hardware like LPDDR memories for autonomous driving is a major challenge for the automotive community.}
|
||
The parts 5 and 11 of the ISO\,26262 define procedures and methods for the development of hardware to achieve a specific automotive safety integrity level. In this paper, we present a novel methodology that combines the hardware metrics analysis of ISO\,26262 with SystemC-based virtual prototyping. To show the applicability of our methodology, we modeled an \new{LPDDR5} memory system of a current state-of-the-art ADAS system and estimated the ASIL of this system. The new methodology is implemented in SystemC and is provided as open-source.
|
||
}
|
||
|
||
|
||
\keywords{ISO\,26262, SystemC, DRAM, LPDDR4, LPDDR5, Safety}
|
||
|
||
%%\pacs[JEL Classification]{D8, H51}
|
||
|
||
%%\pacs[MSC Classification]{35A01, 65L10, 65L12, 65L20, 65L70}
|
||
|
||
\maketitle
|
||
|
||
% PLAN:
|
||
% LPDDR4 durch LPDDR5 ersetzen
|
||
% Neues Systemschaubild
|
||
% Safety Analyse LPDDR5
|
||
% Busfehlerberechung von Kira
|
||
% Performance Simulation von Derek --> Vorteile Orthogonale Simulation aufzeigen
|
||
|
||
\label{sec:intro}
|
||
Functional safety is a major concern in the development of automotive applications because the lives of drivers, passengers, and other road users must be protected to the highest degree. \new{Especially the advent of consumer hardware like LPDDR memories for autonomous driving is a major challenge for the automotive community}.
|
||
Therefore, the development of automotive components requires the usage of specific quality and safety standards, such as ISO\,26262~\cite{iso26262}.
|
||
The implementation of this standard is intended to ensure the functional safety of a system with electrical/electronic components in road vehicles. Especially Parts 5 and 11 of the standard deal with the development processes at the hardware level and define procedures and methods for achieving a specific \textit{Automotive Safety Integrity Level}~(ASIL). For the development of a product, it is therefore very important to address safety concerns already from the beginning. However, the approaches for estimating hardware metrics are based on spreadsheets, which hardly scale for large hardware systems.
|
||
|
||
Virtual prototypes based on SystemC are high-speed, fully functional software models of physical hardware systems that can model complex hardware/ software systems with reasonable simulation speed. These virtual prototypes are used in industry to reduce time to market and improve the quality of the product~\cite{deschutter_14}.
|
||
|
||
\new{This paper is an extended journal version of the previous SAMOS publication~\cite{uecjun_22} where we presented a novel methodology for combining the advantages of SystemC-based virtual prototypes with the safety analysis required by the ISO\,26262 standard.} Compared to previous works, our approach does not focus on the simulation of the system and on error injections; rather, we focus on the specific methodology required by ISO\,26262 regarding the evaluation of the hardware architectural metrics and how this can be implemented in the SystemC standard as an extension. Failure modeling can be seen as a type of modeling that is orthogonal to the modeling of the functionality. However, with our approach, both aspects can be integrated in the same simulation models, which provides the opportunity to analyze structure, functionality, and safety aspects simultaneously. Due to the power of the SystemC framework, we have a high level of interoperability, and functional legacy models could be enhanced by our safety amendments.
|
||
The presented method receives failure rates in \textit{Failure in Time} (FIT) and directly calculates the achievable ASIL as well as the hardware metrics as output.
|
||
|
||
\new{In this paper, we extended the work of our previous paper~\cite{uecjun_22} on LPDDR4 with the analysis of a current LPDDR5 memory system, since the recently introduced LPDDR5 memory features new additional safety measures. Furthermore, we show how strongly these safety measures will affect the performance by combining functional and safety simulations.}
|
||
\newpage
|
||
\noindent In summary, we make the following contributions:
|
||
\begin{itemize}
|
||
\item We present a set of basic blocks that represent the operations required by ISO\,26262.
|
||
\item We present, for the first time, a methodology, called Split'n'Cover, that uses these basic blocks to model and evaluate hardware systems regarding ASIL.
|
||
\item We provide an open-source reference implementation as a SystemC library software\footnote{\url{https://github.com/myzinsky/ISO26262SystemC}}.
|
||
\item We show the application of this methodology for an example \new{LPDDR5} DRAM memory system in the automotive context.
|
||
\item \new{We provide an additional analysis of an actual LPDDR5-based automotive control unit, which combines a performance simulation as well as safety simulations based on our new methodology, such that trade-offs can already be analyzed on the system level.}
|
||
\end{itemize}
|
||
|
||
This paper is structured as follows: Section~\ref{sec:background} provides some background on ISO\,26262 and the required hardware metrics. Related work is discussed in Section~\ref{sec:related}. The methodology is presented in Section~\ref{sec:method}, whereas the actual implementation in SystemC is explained in Section~\ref{sec:implementation}. Section~\ref{sec:study} presents a case study for an \new{LPDDR5} DRAM memory system \new{and Section~\ref{sec:results} presents the experimental results}. Finally, Section~\ref{sec:conclusion} concludes the paper.
|
||
|
||
\section{Background}
|
||
\label{sec:background}
|
||
In this section, we present the basic requirements on safety in order to understand the hardware metric analysis of ISO\,26262.
|
||
For the safety analysis of hardware, we follow the definitions of Laprie et\,al.~\cite{avilap_04}:
|
||
|
||
\begin{description}
|
||
\item[Fault:] Is a defect within the system and the root cause of the violation of a safety goal, e.g., a stuck-at 0 or single event upset due to a cosmic ray.
|
||
\item[Error:] Is an erroneous internal state, e.\,g., in the memory or the CPU, where the fault becomes visible.
|
||
\item[Failure:] Is when the error is observed and the system's behavior deviates from the specification. This might lead to the violation of a safety goal.
|
||
\end{description}
|
||
|
||
As shown in Figure~\ref{fig:bathtube}, hardware failure rates $\lambda(t)$ usually follow the so-called \textit{Bathtub Curve}. In phase \textbf{I}, we can observe early failures called \textit{Infant Mortality}. In phase \textbf{II}, there exists a constant failure rate, known as random failures. This phase is also called \textit{Useful Lifetime}. Therefore, a burn-in process of the product is used to artificially age the product, such that the product enters the marked in phase II. The third phase \textbf{III} shows the \textit{Wear-Out} of the product, where the failure rate increases due to wear-out, i.\,e., aging effects.
|
||
%
|
||
\begin{figure}
|
||
\centering
|
||
\includegraphics[width=\linewidth]{bathtube.pdf}
|
||
\caption{Hardware Failure Bathtub Curve~\cite{iso26262}}
|
||
\label{fig:bathtube}
|
||
\end{figure}
|
||
\newpage
|
||
For the analysis of hardware failures the ISO\,26262 assumes that the hardware is used during its useful lifetime and that the failure rate $\lambda(t)$ is constant (see ISO\,26262-11~\cite{iso26262}).
|
||
%
|
||
For constant failure rates, we can assume the exponential failure distribution
|
||
$$F(t) = 1 - e^{-\lambda\cdot t}$$
|
||
|
||
The constant failure rate of this distribution $\lambda$ is measured in \textit{Failure in Time} (FIT), where 1~FIT represents one failure in $10^9$ hours, which is approximately one failure in 114,080 years. For the hardware metrics, ISO\,26262 distinguishes several different failure rates, among them:
|
||
|
||
\begin{itemize}
|
||
\item $\lambda_\mathrm{SPF}$ \textit{Single-Point Fault Failure Rate}: Considers faults that are not covered by any safety mechanism and immediately lead to the violation of a safety goal.
|
||
\item $\lambda_\mathrm{RF}$ \textit{Residual Fault Failure Rate}: Considers faults where a safety mechanism is implemented, but is not controlled by the safety mechanism and leads to the violation of a safety goal.
|
||
\item $\lambda_\mathrm{MPF}$ \textit{Multi-Point Fault Failure Rate}: Considers several independent faults, which in combination lead to the violation of a safety goal. For this paper, especially the latent faults $\lambda_\mathrm{MPF,L}$ are important, whose presence is neither detected by a safety mechanism nor perceived by the driver.
|
||
\item $\lambda_\mathrm{S}$ \textit{Safe Fault Failure Rate}: Considers faults that do not have any significant influence on the violation of a safety goal.
|
||
\end{itemize}
|
||
|
||
\noindent The total failure rate is the sum of the above failure rates:
|
||
$$\lambda = \lambda_\mathrm{SPF} + \lambda_\mathrm{RF} + \lambda_\mathrm{MPF} + \lambda_\mathrm{S}$$
|
||
%
|
||
|
||
The ISO\,26262 furthermore specifies the hardware metrics used to evaluate the risk posed by hardware elements:
|
||
\begin{description}
|
||
\item[Single-Point Fault Metric (SPFM):] This metric reflects the coverage of a hardware element with respect to single-point faults either by design or by coverage via safety mechanisms.
|
||
$$ \mathrm{SPFM} = 1 - \frac{\sum \left( \lambda_\mathrm{SPF} + \lambda_\mathrm{RF}\right)}{\sum \lambda}$$
|
||
\item[Latent Fault Metric (LFM):]
|
||
This metric reflects the coverage of an hardware element with respect to latent faults either by design (primarily safe faults), fault coverage via safety mechanisms, or by the driver’s recognition of a fault’s existence within the fault-tolerant time interval of a safety goal.
|
||
$$ \mathrm{LFM} = 1 - \frac{\sum \lambda_\mathrm{MPF,L}}{\sum \left(\lambda - \lambda_\mathrm{SPF} - \lambda_\mathrm{RF}\right)}$$
|
||
\end{description}
|
||
\begin{table}[t]
|
||
\centering
|
||
\begin{tblr}{cccc}
|
||
\hline
|
||
\textbf{ASIL} & \textbf{SPFM} & \textbf{LFM} & \textbf{Residual FIT} \\ \hline
|
||
\textbf{A} & - & - & $< 1000$ \\
|
||
\textbf{B} & $> 90\%$ & $> 60\%$ & $< 100 $ \\
|
||
\textbf{C} & $> 97\%$ & $> 80\%$ & $< 100 $ \\
|
||
\textbf{D} & $> 99\%$ & $> 90\%$ & $< 10 $ \\
|
||
\hline
|
||
\end{tblr}
|
||
\vspace{10pt}
|
||
\caption{Requirements according to ISO\,26262~\cite{iso26262}}
|
||
\label{tab:target}
|
||
\end{table}
|
||
|
||
Table \ref{tab:target} shows the required target values for $\lambda_\mathrm{RF}$, SPFM, and LFM to reach a specific ASIL. For example, the highest level ASIL\,D can only be reached if the SPFM is greater than 99\%, the LFM is greater than 90\%, and the residual failure rate is below 10.
|
||
|
||
\section{Related Work}
|
||
\label{sec:related}
|
||
%
|
||
This section discusses related work and the state of the art.
|
||
Today's safety standards, such as ISO\,26262, usually recommend a \textit{Failure Mode and Effect Analysis} (FMEA) or Fault Tree Analysis (FTA). The FMEA is performed by creating spreadsheets. In complex systems, these spreadsheets will grow extensively; multi-levels are traditionally not supported; and for safety engineers, it is really hard to handle this and adapt it during change management.
|
||
A much more structured approach is the so-called \textit{Fault Tree Analysis}~(FTA). Especially the \textit{Component Fault Trees}~(CFT) introduced by Adler et\,al.~\cite{adldom_11} provide several structural advantages compared to spreadsheet approaches. These fault trees allow a clear structuring of the model and fully capture the hardware's complexity all at once. However, with these CFTs, it is not easy to derive the actual metrics required by the ISO\,26262 for the ASIL rating because a \textit{Minimal Cut Set} (MCS) analysis has to be performed, which is a complicated additional step that has to be solved. Furthermore, the optimization of the ASIL level resulting from the introduction of new safety measures is not visible directly due to this additional step. Another shortcoming is the necessary translation of failure rates into failure probabilities, which are used in the FTA, and back again into failure rates to calculate the hardware metrics.
|
||
Adler et\,al.~\cite{adldom_11} already introduced the modeling element \textit{Measure}, which allows making a distinction between faults and failing safety measures and, in principle, makes it possible to retrieve the hardware metrics from the MCS analysis. However, the authors leave open whether the \textit{Measure} element shall be used to express classical failures of the mechanism (expressed by a failure rate) or insufficiencies by design (expressed as diagnostic coverage in percent). The distinction issue that the authors observed for modeling measures also applies to splits in the failure propagation due to the system structure. To model this with CFTs, \textit{basic events} with constant probability combined with an \textit{AND-gate} could be used, but the split is then treated as a failure. Another option would be to separate the initial fault into several basic events for every split during the failure propagation through the system, which leads to high modeling effort in complex systems. Our method overcomes these disadvantages because the metrics can be derived directly and a clear one-to-one mapping between metrics and the hardware exists, which enables easy optimization by introducing new safety measures.
|
||
|
||
The usage of SystemC-based virtual prototypes for safety analysis is already well established. However, all these approaches focus on simulation of the functionality and injection of errors. For example, in~\cite{reipre_13}, the authors present how virtual prototypes can support the FMEA process. There also exist other works whose main focus is on fault injection during functional simulations~\cite{weisch_16},\cite{tabcha_16},\cite{silpar_14} and \cite{tab_19}. As mentioned above, all of these previous works focus on functional simulation and error injection for ISO\,26262 support. The focus of our work lies on the static hardware metrics analysis of ISO\,26262 and how it can be realized within SystemC.
|
||
%
|
||
%
|
||
\include{blocks-safety-methodology}
|
||
%
|
||
\section{Methodology}
|
||
\label{sec:method}
|
||
In the following, we will describe our new methodology for estimating the hardware metrics required by ISO\,26262. Similar to CFTs, our methodology is object-oriented, i.e., it models the system with the hardware components that also exist in reality. The safety behavior of each component is modeled in the component itself by using five central building blocks, which are shown in Figure~\ref{fig:blocks} and explained below.
|
||
|
||
The \textit{Basic Event} block represents internal faults, with a specific failure rate~$\lambda_\mathrm{BE}$.
|
||
The \textit{Sum} block receives the failure rates $\lambda_0$ to $\lambda_n$ and computes the sum of these failure rates.
|
||
|
||
The \textit{Coverage} block can be used to model the \textit{Diagnostic Coverage}~(DC) of a specific safety measure. The input failure rate $\lambda_\mathrm{in}$ is reduced by the DC rate $c$ of this safety measure:
|
||
%
|
||
$$\lambda_\mathrm{RF}=\lambda_\mathrm{in}\cdot(1-c)$$
|
||
%
|
||
For instance, if $\lambda=100\,$FIT and $c=0.95$, only 5\% of the failures, i.e., $5\,$FIT, are propagated.
|
||
According to the ISO\,26262, the covered FITs must be added to the latent failures $\lambda_\mathrm{MPF,L}$ to consider the scenario where the safety measure is defect:
|
||
%
|
||
$$\lambda_\mathrm{MPF,L}=\lambda_\mathrm{in}\cdot c$$
|
||
%
|
||
In our example, $95\,$FIT are propagated to the latent fault metrics if no other measure reduces these failures.
|
||
|
||
The \textit{Split} block distributes the incoming failure rate to an arbitrary number of output ports according to specific rates $p_i$, where the condition
|
||
%
|
||
$$\sum_{i=0}^{n} p_i \leq 1$$
|
||
%
|
||
must hold; otherwise, new failures will be created out of nowhere. It is possible for some parts of the incoming failure rate to completely vanish because of the split, i.e., they are not propagated. These faults are called \textit{Safe Faults} because they will never lead to a safety goal violation. The safe fault failure rate can therefore be described as:
|
||
|
||
$$\lambda_{S} = \lambda_\mathrm{in} \cdot \left(1-\sum_{i=0}^{n} p_i \right)$$
|
||
|
||
In summary, the \textit{Split} block is used to model failure distributions caused by the system structure, e.g., when a data stream is divided, or when the safety mechanism adds additional errors during the correction of unsupported faults, such as double-bit errors in a single-error correction mechanism.
|
||
|
||
The last required block is the \textit{ASIL} block, which calculates the ASIL from the $\lambda_\mathrm{SPF}$, $\lambda_\mathrm{RF}$, and $\lambda_\mathrm{MPF,L}$ within the entire system. This block implements the logic of Table~\ref{tab:target}.
|
||
|
||
With these five blocks, it is possible to model the safety behavior of hardware in compliance with the ISO\,26262. We would like to mention here that it is only necessary to consider faults of safety-related components. Components that are not safety related do not have to be modeled at all, or their errors do not need to be modeled or connected (and thus not considered in the sum of all errors).
|
||
In Section~\ref{sec:study}, we will present the modeling of a real-world automotive memory system in order to understand the interaction of these blocks.
|
||
%
|
||
%
|
||
\section{Implementation}
|
||
\label{sec:implementation}
|
||
%
|
||
%
|
||
In this section, we will describe the implementation of the building blocks in SystemC, which is well established and a de-facto industry standard. Therefore, there already exist a lot of functional simulation models that can be enhanced with our safety methodology. SystemC offers the right infrastructure by providing the concept of modules, ports, and signals that are required for our basic blocks. Unlike graphical safety tools, it also offers programmability, and repetitions can be handled by loops. Furthermore, SystemC's port check is really helpful in the development phase of the safety model since it will complain about unbound ports at the beginning of a simulation. The failure rates are propagated by using a classical \texttt{sc\_signal<double>}.
|
||
For all blocks, we use the dynamic binding of SystemC for the sake of convenience. All blocks contain classical \texttt{SC\_METHOD}s, i.e., during the first delta cycles of the SystemC simulation, all hardware safety measures are already computed and are printed out at the end of a simulation.
|
||
|
||
The first block is the \textit{Basic Event} block shown in Listing~\ref{listing:basic_event}, which receives the failure rate (\texttt{rate}) as a constructor parameter and propagates this value to its output port.
|
||
\begin{listing}[!ht]
|
||
\begin{minted}[
|
||
bgcolor=LightGray,
|
||
fontsize=\footnotesize,
|
||
linenos
|
||
]{c++}
|
||
SC_MODULE(basic_event) {
|
||
sc_out<double> output;
|
||
double rate;
|
||
|
||
SC_HAS_PROCESS(basic_event);
|
||
basic_event(sc_module_name name, double rate) : output("output"),
|
||
rate(rate)
|
||
{
|
||
SC_METHOD(compute_fit);
|
||
}
|
||
|
||
void compute_fit() {
|
||
output.write(rate);
|
||
}
|
||
};
|
||
\end{minted}
|
||
\caption{Implementation of the Basic Event Block in SystemC}
|
||
\label{listing:basic_event}
|
||
\end{listing}
|
||
%
|
||
The \textit{Sum} block has a dynamic input port array and a single output port. In its computation method, it calculates the summation of the incoming failure rates on all input ports, as shown in Listing~\ref{listing:sum}.
|
||
%
|
||
\begin{listing}[!ht]
|
||
\begin{minted}[
|
||
bgcolor=LightGray,
|
||
fontsize=\footnotesize,
|
||
linenos
|
||
]{c++}
|
||
SC_MODULE(sum) {
|
||
sc_port<sc_signal_in_if<double>, 0, SC_ONE_OR_MORE_BOUND> inputs;
|
||
sc_out<double> output;
|
||
|
||
SC_CTOR(sum) : output("output") {
|
||
SC_METHOD(compute_fit);
|
||
sensitive << inputs;
|
||
}
|
||
|
||
void compute_fit() {
|
||
double sum = 0.0;
|
||
for(int i = 0; i < inputs.size(); i++) {
|
||
sum += inputs[i]->read();
|
||
}
|
||
output.write(sum);
|
||
}
|
||
};
|
||
\end{minted}
|
||
\caption{Implementation of the Sum Block in SystemC}
|
||
\label{listing:sum}
|
||
\end{listing}
|
||
%
|
||
The \textit{Coverage} block, shown in Listing~\ref{listing:coverage}, receives the DC as a constructor parameter and calculates $\lambda_\mathrm{RF}$ (\texttt{output}) and $\lambda_\mathrm{MPF,L}$ (\texttt{latent}) according to the formulas presented in Section~\ref{sec:method}.
|
||
%
|
||
\begin{listing}[!ht]
|
||
\begin{minted}[
|
||
bgcolor=LightGray,
|
||
fontsize=\footnotesize,
|
||
linenos
|
||
]{c++}
|
||
SC_MODULE(coverage) {
|
||
sc_in<double> input;
|
||
sc_out<double> output;
|
||
sc_port<sc_signal_inout_if<double>, 0, SC_ZERO_OR_MORE_BOUND> latent;
|
||
|
||
double dc;
|
||
|
||
SC_HAS_PROCESS(coverage);
|
||
coverage(sc_module_name name, double dc) : input("input"),
|
||
output("output"),
|
||
dc(dc)
|
||
{
|
||
SC_METHOD(compute_fit);
|
||
sensitive << input;
|
||
}
|
||
|
||
void compute_fit() {
|
||
output.write(input.read() * (1 - dc));
|
||
if(latent.bind_count() != 0) {
|
||
latent->write(input.read() * dc);
|
||
}
|
||
}
|
||
};
|
||
\end{minted}
|
||
\caption{Implementation of the Coverage Block in SystemC}
|
||
\label{listing:coverage}
|
||
\end{listing}
|
||
%
|
||
Compared to the other blocks, the implementation of the \textit{Split} block is more complicated. Since we want to support dynamic binding and direct assignment of the failure distribution rate, we derived a custom \texttt{sc\_split\_port} from \texttt{sc\_port} that overwrites the \texttt{bind} methods in order to allow specifying the split rate directly with the dynamic binding, as shown in Listing~\ref{listing:port}.
|
||
%
|
||
\begin{listing}[!ht]
|
||
\begin{minted}[
|
||
bgcolor=LightGray,
|
||
fontsize=\footnotesize,
|
||
linenos
|
||
]{c++}
|
||
template <class T>
|
||
class sc_split_out :
|
||
public sc_port<sc_signal_inout_if<T>, 0, SC_ONE_OR_MORE_BOUND>
|
||
{
|
||
public:
|
||
std::vector<double> split_rates;
|
||
|
||
void bind(sc_interface& interface, double rate) {
|
||
sc_port_base::bind(interface);
|
||
split_rates.push_back(rate);
|
||
}
|
||
|
||
void bind(sc_out<double>& parent, double rate) {
|
||
sc_port_base::bind(parent);
|
||
split_rates.push_back(rate);
|
||
}
|
||
};
|
||
\end{minted}
|
||
\caption{Implementation of the Custom Split Port in SystemC}
|
||
\label{listing:port}
|
||
\end{listing}
|
||
%
|
||
The actual implementation of the \textit{Split} component is shown in Listing~\ref{listing:split}. It receives a failure rate as input and distributes it to the output ports according to the assigned split rates.
|
||
%
|
||
\begin{listing}[!ht]
|
||
\begin{minted}[
|
||
bgcolor=LightGray,
|
||
fontsize=\footnotesize,
|
||
linenos
|
||
]{c++}
|
||
SC_MODULE(split) {
|
||
sc_in<double> input;
|
||
sc_split_out<double> outputs;
|
||
|
||
SC_CTOR(split) : input("input") {
|
||
SC_METHOD(compute_fit);
|
||
sensitive << input;
|
||
}
|
||
|
||
void compute_fit() {
|
||
for(int i = 0; i < outputs.size(); i++) {
|
||
double rate = outputs.split_rates.at(i);
|
||
outputs[i]->write(input.read() * rate);
|
||
}
|
||
}
|
||
};
|
||
\end{minted}
|
||
\caption{Implementation of the Split Block in SystemC}
|
||
\label{listing:split}
|
||
\end{listing}
|
||
|
||
The last building block is the \textit{ASIL} block, which estimates the ASIL rate of the system according to Table~\ref{tab:target}. It receives the single point and residual failure rates $\lambda_\mathrm{SPF} +\lambda_\mathrm{RF}$ and the latent failure rates $\lambda_\mathrm{MPF,L}$ as input. Furthermore, it receives the total failure rate $\lambda$ as input, calculates the ASIL level, and prints it from the destructor call at the end of the simulation, as shown in Listing~\ref{listing:asil}.
|
||
%
|
||
\begin{listing}[!ht]
|
||
\begin{minted}[
|
||
bgcolor=LightGray,
|
||
fontsize=\footnotesize,
|
||
linenos
|
||
]{c++}
|
||
SC_MODULE(asil) {
|
||
sc_in<double> residual;
|
||
sc_in<double> latent;
|
||
|
||
double spfm;
|
||
double lfm;
|
||
std::string asil_level;
|
||
double total;
|
||
|
||
SC_HAS_PROCESS(asil);
|
||
asil(sc_module_name name, double total) : total(total) {
|
||
SC_METHOD(compute);
|
||
sensitive << residual << latent;
|
||
}
|
||
|
||
void compute() {
|
||
spfm = 100 * (1 - (residual / total));
|
||
lfm = 100 * (1 - (latent / (total - residual)));
|
||
asil_level = "QM";
|
||
|
||
if (residual < 1000.0) {
|
||
asil_level = "ASIL A";
|
||
}
|
||
if (spfm > 90.0 && lfm > 60.0 && residual < 100.0) {
|
||
asil_level = "ASIL B";
|
||
}
|
||
if (spfm > 97.0 && lfm > 80.0 && residual < 100.0) {
|
||
asil_level = "ASIL C";
|
||
}
|
||
if (spfm > 99.0 && lfm > 90.0 && residual < 10.0) {
|
||
asil_level = "ASIL D";
|
||
}
|
||
}
|
||
|
||
~asil() {
|
||
// Print out of the estimated ASIL level ...
|
||
}
|
||
};
|
||
\end{minted}
|
||
\caption{Implementation of the ASIL Block in SystemC}
|
||
\label{listing:asil}
|
||
\end{listing}
|
||
%
|
||
\newpage
|
||
%
|
||
\section{\new{Case Study with LPDDR5}}
|
||
\label{sec:study}
|
||
%
|
||
\begin{figure}
|
||
\centering
|
||
\begin{circuitikz}
|
||
\useasboundingbox (-5.5,-5.5) rectangle (5.5,5.5);
|
||
\draw[blue] (0,0) node[qfpchip, num pins=16, hide numbers, no topmark, external pins width=0](C){SoC};
|
||
\draw[blue] ( 0, 4) node[qfpchip, num pins=16, hide numbers, no topmark, external pins width=0](D1){LPDDR5};
|
||
\draw[blue] ( 0,-4) node[qfpchip, num pins=16, hide numbers, no topmark, external pins width=0](D2){LPDDR5};
|
||
\draw[blue] (-4, 0) node[qfpchip, num pins=16, hide numbers, no topmark, external pins width=0](D3){LPDDR5};
|
||
\draw[blue] ( 4, 0) node[qfpchip, num pins=16, hide numbers, no topmark, external pins width=0](D4){LPDDR5};
|
||
\draw[blue] (C.bpin 16) to [multiwire=16] (D1.bpin 5);
|
||
\draw[blue] (C.bpin 15) to [multiwire] (D1.bpin 6);
|
||
\draw[blue] (C.bpin 14) to [multiwire] (D1.bpin 7);
|
||
\draw[blue] (C.bpin 13) to [multiwire] (D1.bpin 8);
|
||
\draw[blue] (C.bpin 12) to [multiwire=16] (D4.bpin 1);
|
||
\draw[blue] (C.bpin 11) to [multiwire] (D4.bpin 2);
|
||
\draw[blue] (C.bpin 10) to [multiwire] (D4.bpin 3);
|
||
\draw[blue] (C.bpin 9) to [multiwire] (D4.bpin 4);
|
||
\draw[blue] (C.bpin 8) to [multiwire=16] (D2.bpin 13);
|
||
\draw[blue] (C.bpin 7) to [multiwire] (D2.bpin 14);
|
||
\draw[blue] (C.bpin 6) to [multiwire] (D2.bpin 15);
|
||
\draw[blue] (C.bpin 5) to [multiwire] (D2.bpin 16);
|
||
\draw[blue] (C.bpin 4) to [multiwire=16] (D3.bpin 9);
|
||
\draw[blue] (C.bpin 3) to [multiwire] (D3.bpin 10);
|
||
\draw[blue] (C.bpin 2) to [multiwire] (D3.bpin 11);
|
||
\draw[blue] (C.bpin 1) to [multiwire] (D3.bpin 12);
|
||
\end{circuitikz}
|
||
|
||
\caption{\new{Memory Architecture of an Automotive SoC similar to ORIN~\cite{kar_22}}}
|
||
\label{fig:memory_architecture}
|
||
\end{figure}
|
||
%
|
||
\new{
|
||
In the original conference paper~\cite{uecjun_22} we modeled the automotive LPDDR4 DRAM architecture presented in~\cite{stekra_21}.
|
||
To show the scalability of our approach, in this work we model a more complex and more recent LPDDR5 memory system, which is similar to NVIDIA's Orin platform~\cite{kar_22}.
|
||
Compared to LPDDR4, LPDDR5 introduced a new \textit{Link Error Correction Code} (Link ECC) feature to reduce the high number of interface errors that occur due to higher data transfer rates.}
|
||
|
||
\new{Figure~\ref{fig:memory_architecture} shows the system architecture, which consists of a high-performance \textit{System on Chip} (SoC) and four LPDDR5 devices.
|
||
Each memory device comprises four independent channels, which results in a total of 16 channels with a theoretical maximum bandwidth of \qty{204.8}{\giga\byte\per\second} for the entire control unit.
|
||
To further increase data reliability, the platform is equipped with an in-line ECC mechanism.
|
||
This means that unlike for \textit{Dual In-line Memory Module} (DIMM) based memory systems (e.g., DDR4 or DDR5), where the redundancy is usually stored in an additional, dedicated device (so-called side-band ECC), the redundancy is stored in the same device as the user data.
|
||
%Unlike DDR4 and DDR5 systems, where each \textit{Dual In-line Memory Module} (DIMM) has a dedicated, additional memory device to store the ECC data (so-called side-band ECC), the modeled platform does not have an additional ECC device.
|
||
%For such platforms, it is common to store the ECC data in-line and cache the recently accessed ECC data in the SoC.
|
||
The disadvantage of this approach is that it has an impact on the DRAM performance (bandwidth and latency), since in the worst case each user data access has to be accompanied with an additional DRAM access to fetch the parity bits, while with side-band ECC the parity bits are fetched in parallel with the user data.
|
||
To estimate the performance overhead of the used in-line ECC technique, in Section~\ref{sec:vp}, we model the platform's DRAM subsystem within the DRAM simulator DRAMSys~\cite{junwei_15,stejun_20}.}
|
||
%
|
||
%
|
||
\subsection{\new{LPDDR5 Safety Model}}
|
||
\label{sec:safety-model}
|
||
\new{Figure~\ref{fig:model} shows the safety model of this architecture realized with our new methodology.
|
||
Since all 16 channels are independent, in the following we only model a single LPDDR5 channel for the safety analysis.
|
||
Most of the errors in the model originate in the DRAM array and the DRAM bus.
|
||
According to \cite{buc_20}, \cite{boe_21}, and \cite{stekra_21} we modeled the four main errors that may occur in the DRAM array:
|
||
\textit{Single-Bit Errors} (SBE), \textit{Double-Bit Errors} (DBE), \textit{Multi-Bit Errors} (MBE), and \textit{Wrong Data} (WD).
|
||
The exact distribution of these errors and failure rates was obtained from \textit{Scenario 1} in \cite{stekra_21}.
|
||
As shown in Figure~\ref{fig:model}, these errors propagate upwards in the system to the next component, the internal LPDDR5 \textit{Single Error Correction} (SEC), which uses a $(136,128)$ Hamming ECC.}
|
||
%
|
||
\new{This SEC ECC is a safety mechanism that can correct 100\,\% of all single-bit errors. Therefore, all SBEs are fully covered, reducing the residual failure rate $\lambda_\mathrm{RF}$ for SBEs to zero. This is modeled by using the \textit{Coverage} block. However, if this SEC ECC safety mechanism is defective, the covered failure rate must be added to the latent SBE failure rate $\lambda_\mathrm{MPF,L}$, which propagates to the next component. Additionally, the failure rate of the SEC ECC itself must be added to the latent failure rate. Therefore, we modeled an additional \textit{Basic Event} called \textit{SEC ECC Broken} (SB).}
|
||
|
||
\new{In case of an incoming DBE, two scenarios have to be differentiated. First, if there is a defect in the SEC engine, the DBE will stay a DBE. Second, if there is no defect in the SEC engine, it will either detect that there is an uncorrectable error or attempt to correct the data, resulting in the introduction of a third error. The probability of introducing a third error largely depends on the specific code that is used. According to~\cite{davkap_81,stekra_21}, 83\,\% of the DBEs stay DBEs, while a third error (TBE) is introduced in 17\,\% of the cases. In order to model this behavior, a \textit{Split} component is used, which distributes the incoming DBE failure rate to DBE and TBE failure rates, respectively.
|
||
In the case of an incoming MBE and WD, the SEC engine is not able to correct any bit errors. Thus, these failure rates are always propagated.}
|
||
|
||
\new{Compared to LPDDR4, LPDDR5 supports higher data transfer rates of the bus interface, which, in turn, leads to higher bit error rates. For that reason, LPDDR5 has introduced a link ECC mechanism, which uses a \textit{Single Error Correction Double Error Detection} (SECDED) code in form of a $(137,128)$ Hamming ECC. Therefore, we analyzed the FITs of a typical LPDDR5 interface. According to JEDEC the interface must fulfill at least a \textit{Bit Error Rate} (BER) of $10^{-16}$ for a single DRAM pin. As there are 16 data pins for each DRAM device, we can compute the probability for multi-bit errors with
|
||
\[ p(e) = \binom{n}{e} \cdot \mathrm{BER}^e \cdot \left(1-\mathrm{BER}\right)^{n-e},\]
|
||
where $e$ is the number of errors and $n$ is the number of data pins.}
|
||
|
||
\new{Since the ISO 26262 requires FIT rates for the safety analysis, the probabilities have to be converted. This can be achieved by computing
|
||
\[\lambda_\mathrm{Link}(e) = p(e) \cdot DR \cdot n \cdot \qty{e9}{},\]
|
||
where $DR$ is the data transfer rate of the memory, in our case \qty{6400}{\mega\transfer\per\second}. Table~\ref{tab:bus-errors} shows the FIT rates for SBE, DBE, and MBE, where MBE is computed as
|
||
\[ \mathrm{MBE} = \sum_{e=3}^{16} \lambda_\mathrm{Link}(e).\]
|
||
It is important to highlight that the SBE rate is very large and the DBE and MBE rates can be neglected, i.e., with a BER of $10^{-16}$ it is very unlikely that DBE or MBE will occur. Therefore, also Figure~\ref{fig:model} does not include the DBE and MBE bus errors. This clearly shows the necessity for a SECDED link ECC for high frequency systems to make sure that SBEs will be detected and corrected.}
|
||
|
||
|
||
\begin{table}[t]
|
||
\centering
|
||
\new{\begin{tblr}{lcc}
|
||
\hline
|
||
\textbf{Number of Errors ($e$)} & \textbf{$p(e)$} & \textbf{$\lambda_\mathrm{Link}(e)$} \\\hline
|
||
1 (SBE) & $1.599\cdot10^{-15}$ & $5.898\cdot10^{8}$ \\
|
||
2 (DBE) & $1.200\cdot10^{-30}$ & $4.423\cdot10^{-7}$ \\
|
||
3-16 (MBE) & $5.6\cdot10^{-46}$ & $2.06\cdot10^{-22}$ \\
|
||
\hline
|
||
\end{tblr}}
|
||
\vspace{10pt}
|
||
\caption{\new{Bus Failure Rates}}
|
||
\label{tab:bus-errors}
|
||
\end{table}
|
||
|
||
\begin{figure}
|
||
|
||
\centering
|
||
\newcommand\width{10}
|
||
\begin{circuitikz}
|
||
|
||
\foreach \x in {0,...,63}{
|
||
\ifthenelse{\(\x<16\)\OR\(\x>31\AND\x<48\)}{
|
||
\newcommand\farbe{gray!20}
|
||
}{
|
||
\ifthenelse{\x>55}{
|
||
\ifthenelse{\(\x>55\)\AND\(\x<60\)}{
|
||
\newcommand\farbe{red}
|
||
}{
|
||
\newcommand\farbe{gray}
|
||
}
|
||
}{
|
||
\newcommand\farbe{white}
|
||
}
|
||
}
|
||
\ifthenelse{\x=59}{
|
||
\fill[fill=red] (\x*\width*0.015625, 0.5) rectangle ++(\width*0.015625*0.5, -0.5) {} coordinate(c1);
|
||
\fill[fill=gray](c1) rectangle ++(\width*0.015625*0.5, 0.5) {};
|
||
\node[fit={(\x*\width*0.015625,0)(\x*\width*0.015625+\width*0.015625,0.5)}, inner sep=0pt, draw=black] (rec\x) {};
|
||
}{
|
||
\node[fit={(\x*\width*0.015625,0)(\x*\width*0.015625+\width*0.015625,0.5)}, inner sep=0pt, draw=black, fill=\farbe] (rec\x) {};
|
||
}
|
||
}
|
||
|
||
\draw(rec15.south) to [open] ++(0,-0.15) coordinate(e1);
|
||
\draw[red, thick](rec0.south) ++(0,-0.15) to [short, name=s1] (e1);
|
||
\draw[red](s1.center) -- ++(0,-0.25) -| (rec56);
|
||
|
||
\draw(rec31.south) to [open] ++(0,-0.15) coordinate(e2);
|
||
\draw[red, thick](rec16.south) ++(0,-0.15) to [short, name=s2] (e2);
|
||
\draw[red](s2.center) -- ++(0,-0.50) -| (rec57);
|
||
|
||
\draw(rec47.south) to [open] ++(0,-0.15) coordinate(e3);
|
||
\draw[red, thick](rec32.south) ++(0,-0.15) to [short, name=s3] (e3);
|
||
\draw[red](s3.center) -- ++(0,-0.75) -| (rec58);
|
||
|
||
\draw(rec55.south) to [open] ++(0,-0.15) coordinate(e4);
|
||
\draw[red, thick](rec48.south) ++(0,-0.15) to [short, name=s4] (e4);
|
||
\draw[red](s4.center) -- ++(0,-1.00) -| (rec59);
|
||
|
||
\draw[red] (rec57) ++ (0,0.5) node[]{ECC};
|
||
|
||
\draw[thick] (0,-1.5) rectangle ++(\width,3);
|
||
\draw(0,1.75) node[right]{DRAM Bank};
|
||
\draw(0,0.75) node[right]{DRAM Row};
|
||
|
||
\end{circuitikz}
|
||
\caption{In-Line ECC in a Single DRAM Bank}
|
||
\label{fig:in-line}
|
||
\end{figure}
|
||
|
||
\new{As mentioned before, the memory controller in our automotive platform uses an in-line ECC mechanism with redundancy stored in the same device as user data.
|
||
Figure~\ref{fig:in-line} shows a typical in-line ECC mapping, where the parity bits are stored at the end of the corresponding DRAM row.
|
||
In addition, the figure shows which user data accesses are covered by which ECC accesses.
|
||
Each box corresponds to a single DRAM access (\qty{256}{\bit}, i.e., \qty{32}{\byte}).
|
||
Since the according parity bits are stored in the same row, the additional ECC DRAM access will not result in a row miss, which is beneficial for the performance.
|
||
In Section~\ref{sec:results}, we will discuss the performance overhead for the best-case and worst-case scenarios.
|
||
Again, the used ECC is a Hamming SECDED (272, 256) with 16 bits redundancy per DRAM access.
|
||
%
|
||
Since the redundancy is not stored in an additional chip like in DIMM-based systems, the effective memory size is reduced by \qty{12.5}{\percent}. Moreover, as shown in Figure~\ref{fig:in-line}, 4.5 DRAM accesses per row are currently not used. This unused area could be used for additional safety measures and more powerful ECC algorithms in the future.}
|
||
|
||
\new{There exist further components in the model shown in Figure~\ref{fig:model}. For instance, in the DRAM-TRIM component, the redundancy of the code is removed, possibly also reducing the number of data errors. For further explanations of these components, we refer to the paper~\cite{stekra_21} and the previous conference paper~\cite{uecjun_22}}.
|
||
%
|
||
\subsection{\new{LPDDR5 Performance Model}}
|
||
\label{sec:vp}
|
||
\new{In order to estimate the overhead of the in-line ECC mechanism, we integrated our safety model into the DRAM design space exploration framework \mbox{DRAMSys~\cite{junwei_15,stejun_20}}.
|
||
In DRAMSys we modeled the same architecture as shown in Figure~\ref{fig:memory_architecture}.
|
||
We used traffic generators to stimulate the memory system with a \textit{sequential} and a \textit{random} access pattern.
|
||
In order to generate the additional ECC requests that are required,
|
||
a new module is inserted between a regular traffic generator and the DRAM subsystem.
|
||
This module retains an overview of the currently fetched parity bits for all banks.
|
||
When new parity bits are required, an additional ECC request is performed before the initiating request is issued to the DRAM.
|
||
For each bank, the module can hold the data of four ECC requests (redundancy for one complete row) at once.}
|
||
|
||
\new{Additionally, the addresses of all requests are offset by an incrementing amount to accommodate for the ECC memory regions and the unused space in each DRAM row, as shown in Figure~\ref{fig:in-line}. The offset is derived from the following equations, where $R$ is the original row, $R'$ the new offset row, $C$ the original column and $C'$ the offset column:
|
||
%
|
||
\[
|
||
C'=\left(R\cdot 256+C\right)~\mathrm{mod}~1792
|
||
\]}
|
||
|
||
\new{\[
|
||
R'=\left\lfloor\frac{R\cdot 256+C}{1792}\right\rfloor+R
|
||
\]}
|
||
|
||
\begin{figure}[p]
|
||
\centering
|
||
\include{model}
|
||
\caption{\new{Safety Model of a Single LPDDR5 Channel}}
|
||
\label{fig:model}
|
||
\end{figure}
|
||
|
||
\begin{figure}[p]
|
||
\centering
|
||
\include{result1}
|
||
\caption{Absolute (LPDDR5)}
|
||
\label{fig:absolute}
|
||
\end{figure}
|
||
\begin{figure}[p]
|
||
\centering
|
||
\include{result2}
|
||
\caption{Relative (LPDDR5)}
|
||
\label{fig:relative}
|
||
\end{figure}
|
||
|
||
\section{Experimental Results}
|
||
\label{sec:results}
|
||
In this section we will first discuss the results of the safety analysis and second the results of the performance analysis.
|
||
\subsection{Safety Analysis}
|
||
Unlike in the state of the art, where only a single DRAM failure rate is analyzed, we went one step further with our analysis. Because we are using SystemC, we can easily compute many different scenarios in parallel. In order to analyze the safety behavior of the provided DRAM system and the ECC safety measures, we sweep the DRAM's failure rate $\lambda_\mathrm{DRAM}$ from 1\,FIT to 2500\,FIT in several simulations.
|
||
For this simulation, we assume that only the DRAM related hardware components influence the safety goal under consideration, and leave out other hardware elements on the SoC, which were considered in~\cite{stekra_21}. In practice, failure rate budgets are distributed to the involved hardware elements.
|
||
In this case, as shown in Figure~\ref{fig:absolute}, we could reach the requirement for ASIL\,D ($< 10$ FIT) if the DRAM's failure rate stays below 53\,FIT.
|
||
However, if we take a look at the relative metrics shown in Figure~\ref{fig:relative}, we can see that, with a value of 81\%, the SPFM is far away from the ASIL\,D threshold of 99\%. We cannot even reach ASIL\,B, which has a SPFM threshold of 90\%. From the LFM perspective, we could easily reach ASIL\,B and even ASIL\,C for higher $\lambda_\mathrm{DRAM}$ rates. Since for any ASIL classification, both the relative and absolute metrics must be fulfilled, we can observe that, independent of the DRAM's failure rate $\lambda_\mathrm{DRAM}$, we cannot achieve a higher level than ASIL\,A. Thus, it does not help to improve the failure rates of the DRAM technology itself. \new{Our results for LPDDR5 are similar to the LPDDR4 results of the original conference paper. Although LPDDR5 introduces link ECC as an additional safety measure, high ASIL levels cannot be reached. This clearly shows the necessity to introduce more robust and holistic safety measures within the DRAM and the memory controller as well as on software level.}
|
||
|
||
This confirms the results presented by ~\cite{stekra_21}, \cite{buc_20} for single scenarios. They also conclude that with the current ECC safety measures, no higher safety level than ASIL\,A can be achieved. Since it is not likely that future DRAM technologies will lead to a decrease in the failure rates, it is highly important to introduce further safety measures to make the DRAM system ready for higher ASIL levels.
|
||
|
||
\subsection{Performance Analysis}
|
||
%
|
||
\new{The cost for additional safety measures is usually an impact on performance and storage capacity.
|
||
The overhead in storage, as shown in Figure~\ref{fig:in-line}, is \qty{12.5}{\percent}.
|
||
Effects on performance cannot simply be calculated analytically, so simulations must be carried out.
|
||
%To estimate the performance impact, simulations are required.
|
||
We simulated the performance, i.e., bandwidth and latency of the discussed LPDDR5 memory subsystem, by using a best-case and a worst-case benchmark.
|
||
In DRAM systems, the best case is usually estimated with a sequential access pattern, where addresses are increased incrementally.
|
||
For the worst case, a random access pattern is used, since each memory access results in a row miss, which lowers bandwidth and increases latency.}
|
||
|
||
\new{Figure~\ref{fig:bandwdith} shows the theoretical maximum bandwidth of a single LPDDR5 channel, which is \qty{102.4}{\giga\bit\per\second}.
|
||
With the sequential access pattern we reach a real bandwidth utilization of \qty{100.45}{\giga\bit\per\second} when the ECC functionality is disabled.
|
||
\qty{2}{\percent} of the maximum bandwidth are lost due to refresh.
|
||
When ECC is enabled, the bandwidth drops to \qty{96.84}{\giga\bit\per\second}, which corresponds to a decrease of another \qty{3.5}{\percent}.
|
||
The drop is small because with a sequential pattern all the columns within a row are accessed successively and the fetched parity bits can be fully utilized, i.e., only 4 additional ECC accesses are required for 56 user data accesses (see Figure~\ref{fig:in-line}).
|
||
When the DRAM is stressed with a worst-case scenario, i.e., a fully random access pattern where each data access results in a row miss, the real bandwidth utilization without ECC is \qty{47.28}{\giga\bit\per\second}, which is only \qty{46}{\percent} of the theoretical maximum bandwidth.
|
||
With enabled ECC, the bandwidth drops by another \qty{14}{\percent} to \qty{33.51}{\giga\bit\per\second}.
|
||
In this case the drop is greater because for each user data access one additional ECC access is required.
|
||
This ECC access is at least a row hit.
|
||
When the bandwidth drop is set in direct relation to the real bandwidth utilization, it even corresponds to a decrease of \qty{29}{\percent}, i.e., for random traffic the DRAM channel loses almost one third of its performance due to the additional safety measure.
|
||
}
|
||
%This is due to the high row miss rate and the additional ECC memory accesses.
|
||
|
||
\begin{figure}[t!]
|
||
\centering
|
||
\begin{tikzpicture}
|
||
\begin{axis}[
|
||
ybar=1pt,
|
||
bar width = 20pt,
|
||
ymin=0,
|
||
ymajorgrids,
|
||
yminorgrids,
|
||
ylabel={Avg. Bandwidth [Gbit/s]},
|
||
symbolic x coords = {Sequential, Random, MAX},
|
||
xtick=data,
|
||
enlarge x limits=0.25,
|
||
legend style={at={(0.5,0.95)}, anchor=north,legend columns=1},
|
||
]
|
||
\addplot
|
||
coordinates {(Sequential, 100.45)(Random, 47.28)(MAX, 102.40)};
|
||
\addplot
|
||
coordinates {(Sequential, 96.84)(Random, 33.51)(MAX, 102.40)};
|
||
\legend{\small Without ECC, \small With ECC}
|
||
\end{axis}
|
||
\end{tikzpicture}
|
||
\caption{\new{Bandwidth Comparison}}
|
||
\label{fig:bandwdith}
|
||
\end{figure}
|
||
%
|
||
|
||
% \new{Furthermore, we analyzed the impact of the ECC on latency. The Figures~\ref{fig:linear-wo-ecc}, \ref{fig:linear-w-ecc}, \ref{fig:rand-wo-ecc} and \ref{fig:rand-w-ecc} show the latency histograms for the four investigated scenarios. It can be observed that the latency is only weakly affected in the sequential case, whereas in the random case, the distribution is shifted more towards higher latencies once ECC is enabled. The average latency for the sequential access pattern is \qty{162.4}{\nano\second} without ECC and \qty{168.2}{\nano\second} when ECC is enabled (\qty{3.4}{\percent} increase), whereas for the random case the average latency is \qty{344.3}{\nano\second} without ECC and increases to \qty{487.5}{\nano\second} with ECC (\qty{29.4}{\percent} increase).}
|
||
|
||
\new{Furthermore, we analyzed the impact of the in-line ECC on latency.
|
||
To do this, we varied the frequency requests are issued to the DRAM subsystem starting from \qty{25}{\mega\hertz} in increasing steps of \qty{25}{\mega\hertz} up to \qty{400}{\mega\hertz}, which is the maximum a channel with a data rate of \qty{6400}{\mega\transfer\per\second} and a burst length of 16 could theoretically handle.
|
||
The Figures~\ref{fig:lat_bw:linear} and \ref{fig:lat_bw:random} plot the average response latency of all requests over the bandwidth for the four investigated scenarios.
|
||
In the sequential case the idle response latency is \qty{30}{\nano\second} with disabled ECC and increases only marginally when ECC is enabled (by less than \qty{0.5}{\nano\second}).
|
||
At high request issue frequencies the impact of ECC becomes more visible as the graph starts to saturate slightly earlier and the maximum response latency is higher (\qty{149}{\nano\second} compared to \qty{90}{\nano\second}).
|
||
In the random case the idle response latency without ECC is already \qty{49}{\nano\second} because the target row must always be activated first.
|
||
When ECC is enabled it increases by \qty{10}{\percent} to \qty{54}{\nano\second} because an additional ECC access is issued before each user data access.
|
||
Also, the impact at high request issue frequencies is more significant compared to the sequential case.
|
||
With ECC the graph starts saturating at around \qty{150}{\mega\hertz} compared to \qty{200}{\mega\hertz} without ECC, and the maximum response latency increases from \qty{150}{\nano\second} to \qty{336}{\nano\second}.
|
||
This means that the channel with in-line ECC can handle around \qty{25}{\percent} less random traffic, which is consistent with the bandwidth results in Figure~\ref{fig:bandwdith}.
|
||
% at high freq. saturation starts earlier (150 vs. 200 MHz), higher max response latency
|
||
%Generally, it can be observed that the latency is only weakly affected in the sequential case, whereas in the random case the distribution is shifted more towards higher latencies once ECC is enabled.
|
||
%With the maximum frequency, the average latency for the sequential access pattern is \qty{158.7}{\nano\second} at \qty{99.6}{\giga\bit\per\second} without ECC and \qty{165.6}{\nano\second} at \qty{95.9}{\giga\bit\per\second} when ECC is enabled (\qty{4.3}{\percent} increase in latency), whereas for the random case the average latency is \qty{336.7}{\nano\second} at \qty{47.1}{\giga\bit\per\second} without ECC and increases to \qty{476.1}{\nano\second} at \qty{33.5}{\giga\bit\per\second} with ECC (\qty{41.4}{\percent} increase in latency).
|
||
}
|
||
|
||
\new{In order to establish safety, this is a reasonable performance overhead. However, since the current safety measures are not enough to support levels higher than ASIL\,A, it is necessary to add additional safety measures or other coding techniques like \textit{Cyclic Redundancy Check} (CRC) in the future.}
|
||
|
||
% \include{ecc_results}
|
||
|
||
\begin{figure*}%[h!]
|
||
\begin{subfigure}[b]{0.49\textwidth}
|
||
\centering
|
||
\begin{tikzpicture}
|
||
\begin{axis}[
|
||
ylabel={\textbf{Latency [ns]}},
|
||
xlabel={\textbf{Bandwidth [Gbit/s]}},
|
||
grid=minor,
|
||
width = \textwidth,
|
||
height = 5.25cm,
|
||
%height = 0.9\columnwidth,
|
||
xmin = 0,
|
||
ymin = 0,
|
||
xmax = 120,
|
||
ymax = 500,
|
||
legend style={legend pos=north west, font=\small}
|
||
%xmax = 10000
|
||
]
|
||
|
||
|
||
% Without ECC
|
||
\addplot[BrickRed, thick, mark=square, line cap=round, smooth] coordinates {
|
||
(6.40 , 31.02) % 25 MHz
|
||
(12.80 , 29.45)
|
||
(19.19 , 29.27)
|
||
(25.58 , 28.69)
|
||
(31.97 , 29.50)
|
||
(38.28 , 29.51)
|
||
(44.73 , 29.38)
|
||
(51.10 , 29.17)
|
||
(56.98 , 29.75)
|
||
(63.77 , 30.17)
|
||
(70.21 , 30.61)
|
||
(76.56 , 31.80)
|
||
(82.91 , 33.01)
|
||
(89.24 , 35.59)
|
||
(95.03 , 42.13)
|
||
(99.49 , 90.34) % 400 MHz
|
||
% (99.49 , 147.20)
|
||
% (99.54 , 153.18)
|
||
% (99.59 , 158.73)
|
||
};
|
||
|
||
% With ECC
|
||
\addplot[MidnightBlue, thick, mark=square, line cap=round, smooth] coordinates {
|
||
(6.40 , 31.22) % 25 MHz
|
||
(12.80 , 29.66)
|
||
(19.19 , 29.51)
|
||
(25.58 , 28.90)
|
||
(31.97 , 29.77)
|
||
(38.28 , 29.88)
|
||
(44.73 , 29.94)
|
||
(51.10 , 30.17)
|
||
(56.98 , 31.21)
|
||
(63.77 , 32.06)
|
||
(70.21 , 33.28)
|
||
(76.56 , 35.59)
|
||
(82.86 , 39.22)
|
||
(88.81 , 42.74)
|
||
(91.90 , 60.39)
|
||
(95.86 , 149.22) % 400 MHz
|
||
% (95.86 , 160.59)
|
||
% (95.86 , 163.21)
|
||
% (95.86 , 165.62)
|
||
};
|
||
|
||
\addplot[Black, thick, line cap=round, smooth, dashed] coordinates {
|
||
(102.4, 0)
|
||
(102.4, 500)
|
||
} node[below, pos=0.5, rotate=90, font=\small] {Maximum (102.4\,Gb/s)};
|
||
|
||
\legend{
|
||
Without ECC,
|
||
With ECC,
|
||
}
|
||
\end{axis}
|
||
\end{tikzpicture}
|
||
\caption{Sequential}
|
||
\label{fig:lat_bw:linear}
|
||
\end{subfigure}
|
||
\hfill
|
||
\begin{subfigure}[b]{0.49\textwidth}
|
||
\centering
|
||
\begin{tikzpicture}
|
||
\begin{axis}[
|
||
ylabel={\textbf{Latency [ns]}},
|
||
xlabel={\textbf{Bandwidth [Gbit/s]}},
|
||
grid=minor,
|
||
width = \textwidth,
|
||
height = 5.25cm,
|
||
%height = 0.9\columnwidth,
|
||
xmin = 0,
|
||
ymin = 0,
|
||
xmax = 120,
|
||
ymax = 500
|
||
%xmax = 10000
|
||
]
|
||
|
||
% Without ECC
|
||
\addplot[BrickRed, thick, mark=square, line cap=round, smooth] coordinates {
|
||
(6.40 , 48.87) % 25 MHz
|
||
(12.79 , 53.61)
|
||
(19.18 , 59.94)
|
||
(25.51 , 68.33)
|
||
(31.72 , 80.53)
|
||
(37.87 , 98.56)
|
||
(43.84 , 135.89)
|
||
(46.49 , 305.35) % 200 MHz
|
||
(46.83 , 322.82)
|
||
(46.80 , 329.26)
|
||
% (46.84 , 331.94)
|
||
% (46.87 , 333.30)
|
||
% (46.90 , 333.93)
|
||
% (46.90 , 334.75)
|
||
% (46.99 , 335.81)
|
||
(46.90 , 335.50) % 400 MHz
|
||
% (46.90 , 336.52)
|
||
% (47.05 , 336.72)
|
||
};
|
||
|
||
% With ECC
|
||
\addplot[MidnightBlue, thick, mark=square, line cap=round, smooth] coordinates {
|
||
(6.40 , 53.88) % 25 MHz
|
||
(12.79 , 58.95)
|
||
(19.17 , 66.62)
|
||
(25.50 , 76.68)
|
||
(31.68 , 109.71)
|
||
(33.40 , 440.06) % 150 MHz
|
||
(33.22 , 460.02)
|
||
(33.18 , 469.10)
|
||
% (33.49 , 468.44)
|
||
% (33.47 , 470.61)
|
||
% (33.47 , 472.10)
|
||
% (33.47 , 473.14)
|
||
% (33.47 , 473.94)
|
||
% (33.47 , 474.58)
|
||
% (33.47 , 475.09)
|
||
(33.47 , 475.05) % 400 MHz
|
||
% (33.47 , 475.84)
|
||
% (33.47 , 476.14)
|
||
};
|
||
|
||
\addplot[Black, thick, line cap=round, smooth, dashed] coordinates {
|
||
(102.4, 0)
|
||
(102.4, 500)
|
||
} node[below, pos=0.5, rotate=90, font=\small] {Maximum (102.4\,Gb/s)};
|
||
|
||
\end{axis}
|
||
\end{tikzpicture}
|
||
\caption{Random}
|
||
\label{fig:lat_bw:random}
|
||
\end{subfigure}
|
||
%%%%
|
||
\caption{Average Response Latency over Bandwidth for Sequential and Random Access Patterns}
|
||
\label{fig:lat_bw}
|
||
\end{figure*}
|
||
|
||
%\section{Discussion}
|
||
%\begin{description}
|
||
% \item Safety mechanisms and failure propagation due to structural division are easy to model with the dedicated \textit{Coverage} block and \textit{Split} block and don't have to be expressed extensively by logical expressions.
|
||
% \item Compared to the minimal cut set analysis, events related to coverage and splitting of failures are not included in the results and therefore there is no need for post-correction.
|
||
% \item No need for transformation between failure rates and probabilities.
|
||
% \item Automated calculation of different scenarios due to failure rate interface
|
||
% \item Extension of fault injection tests by results of hardware architectural metrics
|
||
% \item Integration within functional System C blocks.
|
||
% \item Disadvantage: wiring must be done carefully, but System C compiler supports with connection check
|
||
% \item Analytical calculation instead of simulation -> faster and comprehensible, but diagnostic coverage and split distribution must be known
|
||
% \item Divide and conquer: complex failure propagation path handled by analysis component by component
|
||
% \item \textit{Coverage} blocks could be placed in chain, so that LFM is supported (already covered single-point faults of preceding {Coverage} block could be considered again in subsequent {Coverage} block with focus on latent faults
|
||
%\end{description}
|
||
|
||
\section{Conclusion and Future Work}
|
||
\label{sec:conclusion}
|
||
In this paper, we presented a new methodology for modeling the safety behavior of modern hardware systems in compliance with the ISO\,26262 automotive standard. The implementation of this new methodology is provided as an open-source SystemC library and can be used to enhance legacy models with safety and quality analysis. In order to demonstrate the power of this new methodology, we modeled a state-of-the-art automotive DRAM memory architecture. Based on this model, we simulated a continuous space of failure rates of the DRAM system. We conclude that with the current safety measures, it is not possible to achieve a rating higher than ASIL\,A. \new{Furthermore, we combined the safety simulation with a functional simulation, such that the overhead of the safety measures could be estimated quickly. In fact, we see a storage overhead of \qty{12.5}{\percent} and a bandwidth overhead of \qty{3.5}{\percent} in the best case and \qty{14}{\percent} in the worst case. In the future, we will analyze new safety measures that could help reaching the goal of an ASIL\,D certification by using the presented methodology.}
|
||
%
|
||
\section*{Author's Contributions}
|
||
All authors contributed to all parts of the paper.
|
||
%
|
||
\section*{Funding}
|
||
This work was partly funded by the German Federal Ministry of Education and Research (BMBF) under grant 16ME0717 (MANNHEIM-MEMTONOMY) and supported by the Fraunhofer High Performance Center for Simulation- and Software-Based Innovation.
|
||
%
|
||
\section*{Conflict of Interest}
|
||
There is no conflict of interest.
|
||
%
|
||
\section*{Acknowledgements}
|
||
There are no acknowledgements at this time.
|
||
%
|
||
\bibliography{references_JR.bib}% common bib file
|
||
%
|
||
\end{document}
|