From 72ab0f1801dfa8f65bdad9ff4ef47465077fe8ef Mon Sep 17 00:00:00 2001 From: Lukas Steiner Date: Mon, 8 Jul 2024 07:07:46 +0000 Subject: [PATCH] Update on Overleaf. --- letter_reviewers.tex | 232 +++++++++++++++++++++++++++++++++++++++ main.tex | 252 +++++++++++++++++++++---------------------- 2 files changed, 358 insertions(+), 126 deletions(-) create mode 100644 letter_reviewers.tex diff --git a/letter_reviewers.tex b/letter_reviewers.tex new file mode 100644 index 0000000..d8f41a5 --- /dev/null +++ b/letter_reviewers.tex @@ -0,0 +1,232 @@ +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% +%% %% +%% Please do not use \input{...} to include other tex files. %% +%% Submit your LaTeX manuscript as one .tex document. %% +%% %% +%% All additional figures and files should be attached %% +%% separately and not embedded in the \TeX\ document itself. %% +%% %% +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% + +% see https://www.springer.com/journal/10766/submission-guidelines#Instructions%20for%20Authors_Title%20Page for submission guidelines + +%%\documentclass[referee,sn-basic]{sn-jnl}% referee option is meant for double line spacing + +%%=======================================================%% +%% to print line numbers in the margin use lineno option %% +%%=======================================================%% + +%%\documentclass[lineno,sn-basic]{sn-jnl}% Basic Springer Nature Reference Style/Chemistry Reference Style + +%%======================================================%% +%% to compile with pdflatex/xelatex use pdflatex option %% +%%======================================================%% + +%%\documentclass[pdflatex,sn-basic]{sn-jnl}% Basic Springer Nature Reference Style/Chemistry Reference Style + +% necessary hack to load tikz because Springer Nature uses the "program" package which results in errors +% see https://tex.stackexchange.com/a/615043 +\RequirePackage[dvipsnames]{xcolor} +\RequirePackage{tikz} + +%%\documentclass[sn-basic]{sn-jnl}% Basic Springer Nature Reference Style/Chemistry Reference Style +\documentclass[sn-mathphys]{sn-jnl}% Math and Physical Sciences Reference Style +%%\documentclass[sn-aps]{sn-jnl}% American Physical Society (APS) Reference Style +%%\documentclass[sn-vancouver]{sn-jnl}% Vancouver Reference Style +%%\documentclass[sn-apa]{sn-jnl}% APA Reference Style +%%\documentclass[sn-chicago]{sn-jnl}% Chicago-based Humanities Reference Style +%%\documentclass[sn-standardnature]{sn-jnl}% Standard Nature Portfolio Reference Style +%%\documentclass[default]{sn-jnl}% Default +%%\documentclass[default,iicol]{sn-jnl}% Default with double column layout + +%%%% Standard Packages +\usepackage[dvipsnames]{xcolor} +\newcommand\todo[1]{\textcolor{red}{#1}} +\newcommand\new[1]{\textcolor{Cyan}{#1}} +\newcommand\newer[1]{\textcolor{Green}{#1}} +\newcommand\reviewer[1]{ + \textcolor{gray}{\textit{#1}\vspace{0.25cm}} +} +\newcommand\answer[1]{ + #1\vspace{0.25cm} +} + +\usepackage{graphicx} +\usepackage{tabularray} +\usepackage{siunitx} +\DeclareSIUnit\transfer{T} +\sisetup{per-mode = symbol} + +\usepackage{amsmath} +\usepackage{ifthen} + +%\usepackage{tikz} +\usetikzlibrary{positioning} +\usetikzlibrary{backgrounds} +\usetikzlibrary{arrows.meta} +\usepackage{subcaption} + +\usepackage{minted} +\definecolor{LightGray}{gray}{0.9} + +\usepackage{pgfplots} +\pgfplotsset{compat=1.9} +\usepackage{circuitikz} +\usetikzlibrary{fit} +\usetikzlibrary{calc} +\input{blocks} + +\lstset{ + literate={~} {$\sim$}{1} +} + +%\usepackage[hidelinks]{hyperref} --> bereits in template geladen +%%%% + +%%%%%=============================================================================%%%% +%%%% Remarks: This template is provided to aid authors with the preparation +%%%% of original research articles intended for submission to journals published +%%%% by Springer Nature. The guidance has been prepared in partnership with +%%%% production teams to conform to Springer Nature technical requirements. +%%%% Editorial and presentation requirements differ among journal portfolios and +%%%% research disciplines. You may find sections in this template are irrelevant +%%%% to your work and are empowered to omit any such section if allowed by the +%%%% journal you intend to submit to. The submission guidelines and policies +%%%% of the journal take precedence. A detailed User Manual is available in the +%%%% template package for technical guidance. +%%%%%=============================================================================%%%% + +\jyear{2022}% + +\raggedbottom +%%\unnumbered% uncomment this for unnumbered level heads + +\begin{document} +\section*{Letter to the Reviewers} +% +Dear Editor, + +thank you for the valuable reviews of our journal paper. We revised the paper according to the recommendations of the reviewers. We used the long reviewing time also to further improve the quality and also refined some calculations due to discussions that we had with a DRAM vendor. +The additional content of the journal is marked in \new{cyan}, and the additional/updated content of this first revision is marked in \newer{green}. +% +\subsection*{Reviewer 1} +% +\reviewer{The authors have already presented the System-C-based methodology called Split'n'Cover for hardware safety analysis in a previous publication. This paper extends their work by analyzing a hardware system for automotive applications using LPDDR5 memories. A safety and performance analysis, taking into account the ISO 26262 norm and the new features provided by the LPDDR5, are part of the new content. The results show that the bandwidth and storage overhead derived from the new error correction techniques introduced by the LPDDR5 memories are up to 14\% and 12\%, respectively. In comparison to the previous publication, more than 30\% of the content of the current paper is novel.} + +\reviewer{This paper is well-written and based on previous publications. Sections (0) Introduction, (1) Background, (2) Related Work, (3) Methodology, and (4) Implementation are almost the same. No extensions are required so that, initially, the proposed methodology does not change. As in the previous paper, it is easy to understand the proposed methodology and its implementation.} + +\reviewer{Section (5) Case Study is new, introducing the new features implemented on the LPDDR5 memory. The authors emphasize the significance of the Link Error Correction Code (Link ECC) in minimizing transmission errors caused by high data rates. A safety model and a performance model are introduced.} + +\reviewer{Section (6) presents the safety and performance analysis. The results are exactly the same as those presented in the previous publication for LPDDR4. As mentioned by the authors, LPDDR5 introduces the Link ECC; Therefore, a more exhaustive explanation of the reason of non-improvement is desired. Please extend (if possible) this part of the paper.} + +\answer{We thank the reviewer for their suggestion of a more detailed explanation of the differences between the earlier LPDDR4 analysis and the extended LPDDR5 analysis. We agree that the minor differences in the results following from introducing the additional Link-ECC should be explained in more detail and have updated Section (7) Experimental Results to reflect this.} + +\reviewer{Section "1" (Introduction) is missing after the abstract.} + +\answer{We thank the reviewer for pointing out that the section title for the introduction was missing. This has been corrected accordingly.} + +\subsection*{Reviewer 2} + +\reviewer{% +- The paper is well-written and easy to follow. However, there is a less uniform text between the old and new text. +} + +\answer{We have revised the paper, in particular abstract and introduction, and better harmonized the old and new texts.} + +\reviewer{% +- The proposed approach is simple but sound and actually well suited for a composable method, aside from the main modeling of a complex platform. +} + +\reviewer{% +- The added sentence in the abstract in blue is misleading and does not provide to the reader what is expected by the authors. Instead of forcing the example of LPDRR there should be an added sentence about why the advent of consumer hardware is a major challenge. +} + +\reviewer{% +- Similar is for the introduction where there is exactly the same sentence. +} + +\answer{We thank the reviewer for pointing out that the added sentence regarding the emergence of consumer hardware in autonomous systems might miss the message we wanted to convey by focusing on the example of LPDDR. It should more accurately refer to the aspect of new challenges posed by the use of consumer hardware in terms of security considerations. We have refined this part of the abstract and introduction to better convey the intended message.} + +\reviewer{% +- The added contribution of this new version of the paper is limited. The main methodology is exactly the same as presented in the SAMOS paper, while we have only the LPDDR5 use case instead of LPDDR4. The added performance analysis has nothing to do with the core of the proposed approach, or at least this is my feeling from reading the paper. I think that this is the main issue that the paper has in its current form. +} + +\reviewer{% +- The author should probably consider describing a larger proposal to evaluate the impact of possible safety measures since the beginning of the paper. +} + +\answer{We agree with the statement that the main methodology of the SystemC-based and ISO26262-compliant safety analysis first presented in the SAMOS paper is largely the same and would like to thank the reviewer for highlighting this. The additional contribution focuses on the new considerations regarding LPDDR5, such as the new link ECC mechanism added due to the increased interface failure rate as well as the usage of an inline ECC instead of the previous side-band ECC. +As correctly noted, the new performance analysis is essentially orthogonal to the safety analysis. +However, it examines the bandwidth and latency impact of the very same inline ECC mechanism considered in the safety analysis. +We agree that the introduction should include the intent of the paper to analyze a further and larger proposal for safety analysis compared to the earlier SAMOS paper, and have incorporated this accordingly. The focus of the paper should now be clearer to the reader from the beginning on.} + +\subsection*{Reviewer 3} +\reviewer{% +This article describes an approach to computing hardware failure rates using SystemC. For this purpose, the authors implemented specific calculation blocks in SystemC. The authors argue that such an integrative approach is superior to established analysis techniques such as FTA and FMEDA. +In general, the approach seems appealing at first glance, as the constructive inclusion of safety aspects in designs has many advantages over a posteriori analyses. And overall, the approach seems worthy of further development. However, as far as the concrete article is concerned, there are some major flaws from a safety perspective that should be revised: +} + +\reviewer{% +It starts with the related work section: the authors refer to FMEA. For quantitative analysis, which is the goal of their approach, the correct approach would be FMEDA. For FTA, the authors refer directly to component fault tree analysis. First of all, CFT was not introduced by Adler et al. but by Kaiser et al.: Kaiser, B., Liggesmeyer, P. and Mäckel, O., 2003, October. A New Component Concept for Fault Trees. In Proceedings of the 8th Australian workshop on Safety Critical Systems and Software-Volume 33 (pp. 37-46). Moreover, the aspect of \_component\_ fault trees is not the relevant aspect with which to compare their approach, but the general approach of FTA. Consequently, using the analysis concepts introduced by Adler et al. is the wrong benchmark. For example, it is not necessary to compute MCS as long as no qualitative analysis is required, but modern approaches provide very efficient computational engines for quantitative calculations based on BDDs. Possibly they refer to the integration of safety models to design models, but there is also other approaches following such an approach and this does not seem to be key aspect here. Furthermore, it is unclear why an FTA would not be appropriate for considering the introduction of new safety measures - this is what the FTA has been used for for decades. Furthermore, the authors ignore other approaches such as Hip-Hops, AltaRica, Markov models, etc. Their approach is nonetheless novel, but a sound related work analysis seems appropriate for an archival publication. Therefore, it is recommended that the authors provide a more accurate description of the state of the art and a clearer distinction from existing work. +} + +\answer{We thank the reviewer for pointing out that the concept of component fault trees was in fact introduced by Kaiser et al. and we adjusted the reference accordingly. +We have revised the related work section to more accurately describe the state of the art and the aspects relevant to the approach. +} + +\reviewer{% +Regarding the methodology, it is important to note that the key idea of ISO 26262 follows a different direction - a top down guidance for developing safe hardware. The ASILs are derived based on risks. Depending on the ASIL, the standard requires specific measures and mechanisms to be applied constructively in order to sufficiently reduce the residual probability of failure. The metrics were introduced quite late in the standardization process to verify the sufficiency of the applied mechanisms, but the mere compliance of metrics can't replace following the prescribed development process. For example, for the reuse of existing software, ISO 8926 is currently being developed as a dedicated PAS, as measuring metrics is often not considered sufficient evidence. This does not mean that the authors' approach cannot work, but they should proactively address this aspect and show that they understand the basic idea of ISO 26262. +} + +\answer{We agree that our proposed methodology cannot replace the development process described in ISO and thank the reviewer for pointing this out. Rather, our approach is intended to more specifically support hardware developers during the design process by eliminating the need for additional translation steps to calculate the ISO required metrics and by facilitating the understanding of the impact of introduced safety mechanisms. Note this is important for a hardware developer (e.g. Tier 1/2) to facilitate a bottom-up integration process where promises (e.g. safety, performance,..) can be provided to system integrators. There are certainly other aspects that go into determining the ASIL of a HW component with confidence. However, our approach does not claim to be a comprehensive solution in this respect. +} + +\reviewer{% +More critical, however, are some flaws in the math. There's a good reason why fault trees use probabilities instead of failure rates. In the case of Weibull distributions, a constant rate is only given for a certain period of time. For safety, however, we are interested in the worst case, which could be at the beginning or at the end, where a constant rate does not work. Also, one must be very careful not to confuse rates and probabilities. For example, the calculation of lamda\_RF is wrong. According to the description, c is some kind of diagnostic coverage, which is usually a constant probability, not an exponential distribution, i.e., not a rate. Mixing rates and probabilities leads to incorrect results. In this case, the error is on the conservative, i.e. safe side, because multiplying a constant probability by a rate means that the probability grows along the exponential distribution, leading to too high a failure probability. But it leaves the impression that the authors just got lucky. At the very least, they should explicitly state that they are aware of the problem and that they deliberately use a conservative approximation. In fact, diagnostic coverage is only the theoretical maximum that fault detection can achieve. Error detection can also fail due to random or systematic faults (which does not seem to be considered in their case study either). Therefore, the correct model would include an and-gate in a fault tree that models the error AND that the failure detection fails (with a constant probability modeling the DC OR due to a systematic/random exponentially distributed failure probability). Mathematically, however, the result of an AND gate does not have a constant failure rate. Passing this value on to the next calculation block assuming a constant failure rate can easily lead to calculation errors. As well as a wrong model that only considers an approximation of a constant probability as a factor multiplied by a rate. The same is true for the split block - at least to they mix constant probabilities with failure rates in their case study. +It seems recommendable that the authors either rethink their approach of a rate-based calculation (which can easily get tricky) and use probabilities instead. Or, which would probably be the less cumbersome way, to explain (and prove) in more detail why they think they are right, or at least have a conservative approximation. +} + +\answer{% +We thank the reviewer for their thorough analysis of the mathematical soundness of our proposed model. +We fully agree that from a strict mathematical standpoint, the constant failure rates are only an approximation and that the mixing of constant probabilities with those rates in turn do not result in constant rates. +However, our approach is oriented towards the metrics and analysis performed in the ISO26262. +We not only assume a constant failure rate as denoted in the Section \ref{sec:background} referring to the constant region of the bathtub curve, but we also leverage the calculation principles of the ISO. +For example, in our approach we calculate the residual failure rates of a coverage block by multiplying a probability with the input failure rate. +This is in line with the calculation as done by Formula C.3 in ISO26262-5: +$$\lambda_{RF} \leq \lambda_{RF,est} = \lambda \cdot \left(1-\frac{K_{DC,RF}}{100\%}\right)$$ +Similarly, the latent multi point failure rate of our coverage block is calculated in accordance to Formula C.5: +$$\lambda_{MPF,L} \leq \lambda_{MPF,L,est} = \lambda \cdot \left(1-\frac{K_{DC,MPF,L}}{100\%}\right)$$ +Consequently, our approximations are in line with the simplifications that are done by the ISO, which itself refers to these formulas as conservative approximations. +To better convey these approximations to the reader, we have now described this aspect in more detail in Section \ref{sec:background}. + +Further, we fully agree with the reviewer that the error correction and detection capabilities of coverage mechanisms only denote the theoretical maximum, since they themselves could be a failing hardware component. +In our approach we modeled this circumstance by introducing additional basic events that contribute to the total latent multi-point fault metric, as these are faults that become visible in combination with another independent fault. +} + +\reviewer{% +On a more minor note, it would be interesting to see how the approach handles common causes such as heat, EMR, etc. that affect multiple components at once, so that the individual failure rates are no longer independent, which would again lead to incorrect results. +Also, the authors only refer to previous work to determine the failure rates of the basic events. However, we know that simply using different manuals to determine the failure rates of hardware parts can easily lead to differences of two orders of magnitude in the top event. Therefore, it would be interesting to see a sensitivity analysis regarding the robustness of their approach to input variances. Especially since the authors use very precise thresholds in their experimental results, e.g. they assign a budget of 53 FIT, i.e. they talk about 53E-9/h without considering confidence intervals. +In terms of evaluation, it would be good to see a comparison of their approach with a traditional safety analysis to prove its correctness. +} + +\answer{We thank the reviewer for their suggestion to further analyze the impact of common fault causes such as heat that affects multiple components at once. +Indeed, such common causes would result in the failure rates no longer being independent and would require a more thorough analysis. +However, we concentrate on our analysis on a safety element out of context: +The integration of the memory system into the complete vehicle would go beyond the scope of this paper. +Further, we agree that a more extensive sensitivity analysis regarding input variances would be a worthwhile effort that could be subject to further work. +Regarding a comparison with traditional safety analysis, the reference Steiner et al. \cite{stekra_21} analyzes the corresponding LPDDR4 system with a traditional FTA approach, reaching a very similar result. +% \begin{itemize} + % \item Zustimmen, dass solche Analysen interessant wären (Fehlerraten verschiedener Basic Events mit gleicher Ursache) + % \item hier system-out-of-context, keine komplettanalyse des Autos -> würde rahmen sprengen + % \item Vergleich mit traditioneller Safety analysis -\> Vielleicht Verweis auf älteres Paper "An LPDDR4 Safety Model for Automotive Applications"? +% \end{itemize} +} + +\reviewer{% +Overall, however, the approach as such is appealing and the aspects mentioned above seem to be solvable with a reasonable amount of effort and time. For safety, a certain rigor is required to pass a safety assessment, while the article leaves the impression of an inappropriate carelessness when it comes to safety calculations. Therefore, it seems highly recommendable that the authors treat the safety analysis and its math with the appropriate rigor and soundness. +} + +\answer{We would like express our appreciation to reviewer for recognizing the appeal of our novel approach and the confidence in its potential. +We also agree with the observation that the approximations involved should be more clearly described in the text, and have therefore overworked such clarifications so that our approach relies extensively on the estimates made in ISO26262.} + +\end{document} \ No newline at end of file diff --git a/main.tex b/main.tex index f89ab86..628d9a7 100644 --- a/main.tex +++ b/main.tex @@ -102,135 +102,135 @@ %%\unnumbered% uncomment this for unnumbered level heads \begin{document} -\section*{Letter to the Reviewers} +%\section*{Letter to the Reviewers} +%% +%Dear Editor, % -Dear Editor, - -thank you for the valuable reviews of our journal paper. We revised the paper according to the recommendations of the reviewers. We used the long reviewing time also to further improve the quality and also refined some calculations due to discussions that we had with a DRAM vendor. -The additional content of the journal is marked in \new{cyan}, and the additional/updated content of this first revision is marked in \newer{green}. +%thank you for the valuable reviews of our journal paper. We revised the paper according to the recommendations of the reviewers. We used the long reviewing time also to further improve the quality and also refined some calculations due to discussions that we had with a DRAM vendor. +%The additional content of the journal is marked in \new{cyan}, and the additional/updated content of this first revision is marked in \newer{green}. +%% +%\subsection*{Reviewer 1} +%% +%\reviewer{The authors have already presented the System-C-based methodology called Split'n'Cover for hardware safety analysis in a previous publication. This paper extends their work by analyzing a hardware system for automotive applications using LPDDR5 memories. A safety and performance analysis, taking into account the ISO 26262 norm and the new features provided by the LPDDR5, are part of the new content. The results show that the bandwidth and storage overhead derived from the new error correction techniques introduced by the LPDDR5 memories are up to 14\% and 12\%, respectively. In comparison to the previous publication, more than 30\% of the content of the current paper is novel.} % -\subsection*{Reviewer 1} +%\reviewer{This paper is well-written and based on previous publications. Sections (0) Introduction, (1) Background, (2) Related Work, (3) Methodology, and (4) Implementation are almost the same. No extensions are required so that, initially, the proposed methodology does not change. As in the previous paper, it is easy to understand the proposed methodology and its implementation.} +% +%\reviewer{Section (5) Case Study is new, introducing the new features implemented on the LPDDR5 memory. The authors emphasize the significance of the Link Error Correction Code (Link ECC) in minimizing transmission errors caused by high data rates. A safety model and a performance model are introduced.} +% +%\reviewer{Section (6) presents the safety and performance analysis. The results are exactly the same as those presented in the previous publication for LPDDR4. As mentioned by the authors, LPDDR5 introduces the Link ECC; Therefore, a more exhaustive explanation of the reason of non-improvement is desired. Please extend (if possible) this part of the paper.} +% +%\answer{We thank the reviewer for their suggestion of a more detailed explanation of the differences between the earlier LPDDR4 analysis and the extended LPDDR5 analysis. We agree that the minor differences in the results following from introducing the additional Link-ECC should be explained in more detail and have updated Section (7) Experimental Results to reflect this.} +% +%\reviewer{Section "1" (Introduction) is missing after the abstract.} +% +%\answer{We thank the reviewer for pointing out that the section title for the introduction was missing. This has been corrected accordingly.} +% +%\subsection*{Reviewer 2} +% +%\reviewer{% +%- The paper is well-written and easy to follow. However, there is a less uniform text between the old and new text. +%} +% +%\answer{We have revised the paper, in particular abstract and introduction, and better harmonized the old and new texts.} +% +%\reviewer{% +%- The proposed approach is simple but sound and actually well suited for a composable method, aside from the main modeling of a complex platform. +%} +% +%\reviewer{% +%- The added sentence in the abstract in blue is misleading and does not provide to the reader what is expected by the authors. Instead of forcing the example of LPDRR there should be an added sentence about why the advent of consumer hardware is a major challenge. +%} +% +%\reviewer{% +%- Similar is for the introduction where there is exactly the same sentence. +%} +% +%\answer{We thank the reviewer for pointing out that the added sentence regarding the emergence of consumer hardware in autonomous systems might miss the message we wanted to convey by focusing on the example of LPDDR. It should more accurately refer to the aspect of new challenges posed by the use of consumer hardware in terms of security considerations. We have refined this part of the abstract and introduction to better convey the intended message.} +% +%\reviewer{% +%- The added contribution of this new version of the paper is limited. The main methodology is exactly the same as presented in the SAMOS paper, while we have only the LPDDR5 use case instead of LPDDR4. The added performance analysis has nothing to do with the core of the proposed approach, or at least this is my feeling from reading the paper. I think that this is the main issue that the paper has in its current form. +%} +% +%\reviewer{% +%- The author should probably consider describing a larger proposal to evaluate the impact of possible safety measures since the beginning of the paper. +%} +% +%\answer{We agree with the statement that the main methodology of the SystemC-based and ISO26262-compliant safety analysis first presented in the SAMOS paper is largely the same and would like to thank the reviewer for highlighting this. The additional contribution focuses on the new considerations regarding LPDDR5, such as the new link ECC mechanism added due to the increased interface failure rate as well as the usage of an inline ECC instead of the previous side-band ECC. +%As correctly noted, the new performance analysis is essentially orthogonal to the safety analysis. +%However, it examines the bandwidth and latency impact of the very same inline ECC mechanism considered in the safety analysis. +%We agree that the introduction should include the intent of the paper to analyze a further and larger proposal for safety analysis compared to the earlier SAMOS paper, and have incorporated this accordingly. The focus of the paper should now be clearer to the reader from the beginning on.} +% +%\subsection*{Reviewer 3} +%\reviewer{% +%This article describes an approach to computing hardware failure rates using SystemC. For this purpose, the authors implemented specific calculation blocks in SystemC. The authors argue that such an integrative approach is superior to established analysis techniques such as FTA and FMEDA. +%In general, the approach seems appealing at first glance, as the constructive inclusion of safety aspects in designs has many advantages over a posteriori analyses. And overall, the approach seems worthy of further development. However, as far as the concrete article is concerned, there are some major flaws from a safety perspective that should be revised: +%} +% +%\reviewer{% +%It starts with the related work section: the authors refer to FMEA. For quantitative analysis, which is the goal of their approach, the correct approach would be FMEDA. For FTA, the authors refer directly to component fault tree analysis. First of all, CFT was not introduced by Adler et al. but by Kaiser et al.: Kaiser, B., Liggesmeyer, P. and Mäckel, O., 2003, October. A New Component Concept for Fault Trees. In Proceedings of the 8th Australian workshop on Safety Critical Systems and Software-Volume 33 (pp. 37-46). Moreover, the aspect of \_component\_ fault trees is not the relevant aspect with which to compare their approach, but the general approach of FTA. Consequently, using the analysis concepts introduced by Adler et al. is the wrong benchmark. For example, it is not necessary to compute MCS as long as no qualitative analysis is required, but modern approaches provide very efficient computational engines for quantitative calculations based on BDDs. Possibly they refer to the integration of safety models to design models, but there is also other approaches following such an approach and this does not seem to be key aspect here. Furthermore, it is unclear why an FTA would not be appropriate for considering the introduction of new safety measures - this is what the FTA has been used for for decades. Furthermore, the authors ignore other approaches such as Hip-Hops, AltaRica, Markov models, etc. Their approach is nonetheless novel, but a sound related work analysis seems appropriate for an archival publication. Therefore, it is recommended that the authors provide a more accurate description of the state of the art and a clearer distinction from existing work. +%} +% +%\answer{We thank the reviewer for pointing out that the concept of component fault trees was in fact introduced by Kaiser et al. and we adjusted the reference accordingly. +%We have revised the related work section to more accurately describe the state of the art and the aspects relevant to the approach. +%} +% +%\reviewer{% +%Regarding the methodology, it is important to note that the key idea of ISO 26262 follows a different direction - a top down guidance for developing safe hardware. The ASILs are derived based on risks. Depending on the ASIL, the standard requires specific measures and mechanisms to be applied constructively in order to sufficiently reduce the residual probability of failure. The metrics were introduced quite late in the standardization process to verify the sufficiency of the applied mechanisms, but the mere compliance of metrics can't replace following the prescribed development process. For example, for the reuse of existing software, ISO 8926 is currently being developed as a dedicated PAS, as measuring metrics is often not considered sufficient evidence. This does not mean that the authors' approach cannot work, but they should proactively address this aspect and show that they understand the basic idea of ISO 26262. +%} +% +%\answer{We agree that our proposed methodology cannot replace the development process described in ISO and thank the reviewer for pointing this out. Rather, our approach is intended to more specifically support hardware developers during the design process by eliminating the need for additional translation steps to calculate the ISO required metrics and by facilitating the understanding of the impact of introduced safety mechanisms. Note this is important for a hardware developer (e.g. Tier 1/2) to facilitate a bottom-up integration process where promises (e.g. safety, performance,..) can be provided to system integrators. There are certainly other aspects that go into determining the ASIL of a HW component with confidence. However, our approach does not claim to be a comprehensive solution in this respect. +%} +% +%\reviewer{% +%More critical, however, are some flaws in the math. There's a good reason why fault trees use probabilities instead of failure rates. In the case of Weibull distributions, a constant rate is only given for a certain period of time. For safety, however, we are interested in the worst case, which could be at the beginning or at the end, where a constant rate does not work. Also, one must be very careful not to confuse rates and probabilities. For example, the calculation of lamda\_RF is wrong. According to the description, c is some kind of diagnostic coverage, which is usually a constant probability, not an exponential distribution, i.e., not a rate. Mixing rates and probabilities leads to incorrect results. In this case, the error is on the conservative, i.e. safe side, because multiplying a constant probability by a rate means that the probability grows along the exponential distribution, leading to too high a failure probability. But it leaves the impression that the authors just got lucky. At the very least, they should explicitly state that they are aware of the problem and that they deliberately use a conservative approximation. In fact, diagnostic coverage is only the theoretical maximum that fault detection can achieve. Error detection can also fail due to random or systematic faults (which does not seem to be considered in their case study either). Therefore, the correct model would include an and-gate in a fault tree that models the error AND that the failure detection fails (with a constant probability modeling the DC OR due to a systematic/random exponentially distributed failure probability). Mathematically, however, the result of an AND gate does not have a constant failure rate. Passing this value on to the next calculation block assuming a constant failure rate can easily lead to calculation errors. As well as a wrong model that only considers an approximation of a constant probability as a factor multiplied by a rate. The same is true for the split block - at least to they mix constant probabilities with failure rates in their case study. +%It seems recommendable that the authors either rethink their approach of a rate-based calculation (which can easily get tricky) and use probabilities instead. Or, which would probably be the less cumbersome way, to explain (and prove) in more detail why they think they are right, or at least have a conservative approximation. +%} +% +%\answer{% +%We thank the reviewer for their thorough analysis of the mathematical soundness of our proposed model. +%We fully agree that from a strict mathematical standpoint, the constant failure rates are only an approximation and that the mixing of constant probabilities with those rates in turn do not result in constant rates. +%However, our approach is oriented towards the metrics and analysis performed in the ISO26262. +%We not only assume a constant failure rate as denoted in the Section \ref{sec:background} referring to the constant region of the bathtub curve, but we also leverage the calculation principles of the ISO. +%For example, in our approach we calculate the residual failure rates of a coverage block by multiplying a probability with the input failure rate. +%This is in line with the calculation as done by Formula C.3 in ISO26262-5: +%$$\lambda_{RF} \leq \lambda_{RF,est} = \lambda \cdot \left(1-\frac{K_{DC,RF}}{100\%}\right)$$ +%Similarly, the latent multi point failure rate of our coverage block is calculated in accordance to Formula C.5: +%$$\lambda_{MPF,L} \leq \lambda_{MPF,L,est} = \lambda \cdot \left(1-\frac{K_{DC,MPF,L}}{100\%}\right)$$ +%Consequently, our approximations are in line with the simplifications that are done by the ISO, which itself refers to these formulas as conservative approximations. +%To better convey these approximations to the reader, we have now described this aspect in more detail in Section \ref{sec:background}. +% +%Further, we fully agree with the reviewer that the error correction and detection capabilities of coverage mechanisms only denote the theoretical maximum, since they themselves could be a failing hardware component. +%In our approach we modeled this circumstance by introducing additional basic events that contribute to the total latent multi-point fault metric, as these are faults that become visible in combination with another independent fault. +%} +% +%\reviewer{% +%On a more minor note, it would be interesting to see how the approach handles common causes such as heat, EMR, etc. that affect multiple components at once, so that the individual failure rates are no longer independent, which would again lead to incorrect results. +%Also, the authors only refer to previous work to determine the failure rates of the basic events. However, we know that simply using different manuals to determine the failure rates of hardware parts can easily lead to differences of two orders of magnitude in the top event. Therefore, it would be interesting to see a sensitivity analysis regarding the robustness of their approach to input variances. Especially since the authors use very precise thresholds in their experimental results, e.g. they assign a budget of 53 FIT, i.e. they talk about 53E-9/h without considering confidence intervals. +%In terms of evaluation, it would be good to see a comparison of their approach with a traditional safety analysis to prove its correctness. +%} +% +%\answer{We thank the reviewer for their suggestion to further analyze the impact of common fault causes such as heat that affects multiple components at once. +%Indeed, such common causes would result in the failure rates no longer being independent and would require a more thorough analysis. +%However, we concentrate on our analysis on a safety element out of context: +%The integration of the memory system into the complete vehicle would go beyond the scope of this paper. +%Further, we agree that a more extensive sensitivity analysis regarding input variances would be a worthwhile effort that could be subject to further work. +%Regarding a comparison with traditional safety analysis, the reference Steiner et al. \cite{stekra_21} analyzes the corresponding LPDDR4 system with a traditional FTA approach, reaching a very similar result. +%% \begin{itemize} +% % \item Zustimmen, dass solche Analysen interessant wären (Fehlerraten verschiedener Basic Events mit gleicher Ursache) +% % \item hier system-out-of-context, keine komplettanalyse des Autos -> würde rahmen sprengen +% % \item Vergleich mit traditioneller Safety analysis -\> Vielleicht Verweis auf älteres Paper "An LPDDR4 Safety Model for Automotive Applications"? +%% \end{itemize} +%} +% +%\reviewer{% +%Overall, however, the approach as such is appealing and the aspects mentioned above seem to be solvable with a reasonable amount of effort and time. For safety, a certain rigor is required to pass a safety assessment, while the article leaves the impression of an inappropriate carelessness when it comes to safety calculations. Therefore, it seems highly recommendable that the authors treat the safety analysis and its math with the appropriate rigor and soundness. +%} +% +%\answer{We would like express our appreciation to reviewer for recognizing the appeal of our novel approach and the confidence in its potential. +%We also agree with the observation that the approximations involved should be more clearly described in the text, and have therefore overworked such clarifications so that our approach relies extensively on the estimates made in ISO26262.} +% +%\newpage % -\reviewer{The authors have already presented the System-C-based methodology called Split'n'Cover for hardware safety analysis in a previous publication. This paper extends their work by analyzing a hardware system for automotive applications using LPDDR5 memories. A safety and performance analysis, taking into account the ISO 26262 norm and the new features provided by the LPDDR5, are part of the new content. The results show that the bandwidth and storage overhead derived from the new error correction techniques introduced by the LPDDR5 memories are up to 14\% and 12\%, respectively. In comparison to the previous publication, more than 30\% of the content of the current paper is novel.} - -\reviewer{This paper is well-written and based on previous publications. Sections (0) Introduction, (1) Background, (2) Related Work, (3) Methodology, and (4) Implementation are almost the same. No extensions are required so that, initially, the proposed methodology does not change. As in the previous paper, it is easy to understand the proposed methodology and its implementation.} - -\reviewer{Section (5) Case Study is new, introducing the new features implemented on the LPDDR5 memory. The authors emphasize the significance of the Link Error Correction Code (Link ECC) in minimizing transmission errors caused by high data rates. A safety model and a performance model are introduced.} - -\reviewer{Section (6) presents the safety and performance analysis. The results are exactly the same as those presented in the previous publication for LPDDR4. As mentioned by the authors, LPDDR5 introduces the Link ECC; Therefore, a more exhaustive explanation of the reason of non-improvement is desired. Please extend (if possible) this part of the paper.} - -\answer{We thank the reviewer for their suggestion of a more detailed explanation of the differences between the earlier LPDDR4 analysis and the extended LPDDR5 analysis. We agree that the minor differences in the results following from introducing the additional Link-ECC should be explained in more detail and have updated Section (7) Experimental Results to reflect this.} - -\reviewer{Section "1" (Introduction) is missing after the abstract.} - -\answer{We thank the reviewer for pointing out that the section title for the introduction was missing. This has been corrected accordingly.} - -\subsection*{Reviewer 2} - -\reviewer{% -- The paper is well-written and easy to follow. However, there is a less uniform text between the old and new text. -} - -\answer{We have revised the paper, in particular abstract and introduction, and better harmonized the old and new texts.} - -\reviewer{% -- The proposed approach is simple but sound and actually well suited for a composable method, aside from the main modeling of a complex platform. -} - -\reviewer{% -- The added sentence in the abstract in blue is misleading and does not provide to the reader what is expected by the authors. Instead of forcing the example of LPDRR there should be an added sentence about why the advent of consumer hardware is a major challenge. -} - -\reviewer{% -- Similar is for the introduction where there is exactly the same sentence. -} - -\answer{We thank the reviewer for pointing out that the added sentence regarding the emergence of consumer hardware in autonomous systems might miss the message we wanted to convey by focusing on the example of LPDDR. It should more accurately refer to the aspect of new challenges posed by the use of consumer hardware in terms of security considerations. We have refined this part of the abstract and introduction to better convey the intended message.} - -\reviewer{% -- The added contribution of this new version of the paper is limited. The main methodology is exactly the same as presented in the SAMOS paper, while we have only the LPDDR5 use case instead of LPDDR4. The added performance analysis has nothing to do with the core of the proposed approach, or at least this is my feeling from reading the paper. I think that this is the main issue that the paper has in its current form. -} - -\reviewer{% -- The author should probably consider describing a larger proposal to evaluate the impact of possible safety measures since the beginning of the paper. -} - -\answer{We agree with the statement that the main methodology of the SystemC-based and ISO26262-compliant safety analysis first presented in the SAMOS paper is largely the same and would like to thank the reviewer for highlighting this. The additional contribution focuses on the new considerations regarding LPDDR5, such as the new link ECC mechanism added due to the increased interface failure rate as well as the usage of an inline ECC instead of the previous side-band ECC. -As correctly noted, the new performance analysis is essentially orthogonal to the safety analysis. -However, it examines the bandwidth and latency impact of the very same inline ECC mechanism considered in the safety analysis. -We agree that the introduction should include the intent of the paper to analyze a further and larger proposal for safety analysis compared to the earlier SAMOS paper, and have incorporated this accordingly. The focus of the paper should now be clearer to the reader from the beginning on.} - -\subsection*{Reviewer 3} -\reviewer{% -This article describes an approach to computing hardware failure rates using SystemC. For this purpose, the authors implemented specific calculation blocks in SystemC. The authors argue that such an integrative approach is superior to established analysis techniques such as FTA and FMEDA. -In general, the approach seems appealing at first glance, as the constructive inclusion of safety aspects in designs has many advantages over a posteriori analyses. And overall, the approach seems worthy of further development. However, as far as the concrete article is concerned, there are some major flaws from a safety perspective that should be revised: -} - -\reviewer{% -It starts with the related work section: the authors refer to FMEA. For quantitative analysis, which is the goal of their approach, the correct approach would be FMEDA. For FTA, the authors refer directly to component fault tree analysis. First of all, CFT was not introduced by Adler et al. but by Kaiser et al.: Kaiser, B., Liggesmeyer, P. and Mäckel, O., 2003, October. A New Component Concept for Fault Trees. In Proceedings of the 8th Australian workshop on Safety Critical Systems and Software-Volume 33 (pp. 37-46). Moreover, the aspect of \_component\_ fault trees is not the relevant aspect with which to compare their approach, but the general approach of FTA. Consequently, using the analysis concepts introduced by Adler et al. is the wrong benchmark. For example, it is not necessary to compute MCS as long as no qualitative analysis is required, but modern approaches provide very efficient computational engines for quantitative calculations based on BDDs. Possibly they refer to the integration of safety models to design models, but there is also other approaches following such an approach and this does not seem to be key aspect here. Furthermore, it is unclear why an FTA would not be appropriate for considering the introduction of new safety measures - this is what the FTA has been used for for decades. Furthermore, the authors ignore other approaches such as Hip-Hops, AltaRica, Markov models, etc. Their approach is nonetheless novel, but a sound related work analysis seems appropriate for an archival publication. Therefore, it is recommended that the authors provide a more accurate description of the state of the art and a clearer distinction from existing work. -} - -\answer{We thank the reviewer for pointing out that the concept of component fault trees was in fact introduced by Kaiser et al. and we adjusted the reference accordingly. -We have revised the related work section to more accurately describe the state of the art and the aspects relevant to the approach. -} - -\reviewer{% -Regarding the methodology, it is important to note that the key idea of ISO 26262 follows a different direction - a top down guidance for developing safe hardware. The ASILs are derived based on risks. Depending on the ASIL, the standard requires specific measures and mechanisms to be applied constructively in order to sufficiently reduce the residual probability of failure. The metrics were introduced quite late in the standardization process to verify the sufficiency of the applied mechanisms, but the mere compliance of metrics can't replace following the prescribed development process. For example, for the reuse of existing software, ISO 8926 is currently being developed as a dedicated PAS, as measuring metrics is often not considered sufficient evidence. This does not mean that the authors' approach cannot work, but they should proactively address this aspect and show that they understand the basic idea of ISO 26262. -} - -\answer{We agree that our proposed methodology cannot replace the development process described in ISO and thank the reviewer for pointing this out. Rather, our approach is intended to more specifically support hardware developers during the design process by eliminating the need for additional translation steps to calculate the ISO required metrics and by facilitating the understanding of the impact of introduced safety mechanisms. Note this is important for a hardware developer (e.g. Tier 1/2) to facilitate a bottom-up integration process where promises (e.g. safety, performance,..) can be provided to system integrators. There are certainly other aspects that go into determining the ASIL of a HW component with confidence. However, our approach does not claim to be a comprehensive solution in this respect. -} - -\reviewer{% -More critical, however, are some flaws in the math. There's a good reason why fault trees use probabilities instead of failure rates. In the case of Weibull distributions, a constant rate is only given for a certain period of time. For safety, however, we are interested in the worst case, which could be at the beginning or at the end, where a constant rate does not work. Also, one must be very careful not to confuse rates and probabilities. For example, the calculation of lamda\_RF is wrong. According to the description, c is some kind of diagnostic coverage, which is usually a constant probability, not an exponential distribution, i.e., not a rate. Mixing rates and probabilities leads to incorrect results. In this case, the error is on the conservative, i.e. safe side, because multiplying a constant probability by a rate means that the probability grows along the exponential distribution, leading to too high a failure probability. But it leaves the impression that the authors just got lucky. At the very least, they should explicitly state that they are aware of the problem and that they deliberately use a conservative approximation. In fact, diagnostic coverage is only the theoretical maximum that fault detection can achieve. Error detection can also fail due to random or systematic faults (which does not seem to be considered in their case study either). Therefore, the correct model would include an and-gate in a fault tree that models the error AND that the failure detection fails (with a constant probability modeling the DC OR due to a systematic/random exponentially distributed failure probability). Mathematically, however, the result of an AND gate does not have a constant failure rate. Passing this value on to the next calculation block assuming a constant failure rate can easily lead to calculation errors. As well as a wrong model that only considers an approximation of a constant probability as a factor multiplied by a rate. The same is true for the split block - at least to they mix constant probabilities with failure rates in their case study. -It seems recommendable that the authors either rethink their approach of a rate-based calculation (which can easily get tricky) and use probabilities instead. Or, which would probably be the less cumbersome way, to explain (and prove) in more detail why they think they are right, or at least have a conservative approximation. -} - -\answer{% -We thank the reviewer for their thorough analysis of the mathematical soundness of our proposed model. -We fully agree that from a strict mathematical standpoint, the constant failure rates are only an approximation and that the mixing of constant probabilities with those rates in turn do not result in constant rates. -However, our approach is oriented towards the metrics and analysis performed in the ISO26262. -We not only assume a constant failure rate as denoted in the Section \ref{sec:background} referring to the constant region of the bathtub curve, but we also leverage the calculation principles of the ISO. -For example, in our approach we calculate the residual failure rates of a coverage block by multiplying a probability with the input failure rate. -This is in line with the calculation as done by Formula C.3 in ISO26262-5: -$$\lambda_{RF} \leq \lambda_{RF,est} = \lambda \cdot \left(1-\frac{K_{DC,RF}}{100\%}\right)$$ -Similarly, the latent multi point failure rate of our coverage block is calculated in accordance to Formula C.5: -$$\lambda_{MPF,L} \leq \lambda_{MPF,L,est} = \lambda \cdot \left(1-\frac{K_{DC,MPF,L}}{100\%}\right)$$ -Consequently, our approximations are in line with the simplifications that are done by the ISO, which itself refers to these formulas as conservative approximations. -To better convey these approximations to the reader, we have now described this aspect in more detail in Section \ref{sec:background}. - -Further, we fully agree with the reviewer that the error correction and detection capabilities of coverage mechanisms only denote the theoretical maximum, since they themselves could be a failing hardware component. -In our approach we modeled this circumstance by introducing additional basic events that contribute to the total latent multi-point fault metric, as these are faults that become visible in combination with another independent fault. -} - -\reviewer{% -On a more minor note, it would be interesting to see how the approach handles common causes such as heat, EMR, etc. that affect multiple components at once, so that the individual failure rates are no longer independent, which would again lead to incorrect results. -Also, the authors only refer to previous work to determine the failure rates of the basic events. However, we know that simply using different manuals to determine the failure rates of hardware parts can easily lead to differences of two orders of magnitude in the top event. Therefore, it would be interesting to see a sensitivity analysis regarding the robustness of their approach to input variances. Especially since the authors use very precise thresholds in their experimental results, e.g. they assign a budget of 53 FIT, i.e. they talk about 53E-9/h without considering confidence intervals. -In terms of evaluation, it would be good to see a comparison of their approach with a traditional safety analysis to prove its correctness. -} - -\answer{We thank the reviewer for their suggestion to further analyze the impact of common fault causes such as heat that affects multiple components at once. -Indeed, such common causes would result in the failure rates no longer being independent and would require a more thorough analysis. -However, we concentrate on our analysis on a safety element out of context: -The integration of the memory system into the complete vehicle would go beyond the scope of this paper. -Further, we agree that a more extensive sensitivity analysis regarding input variances would be a worthwhile effort that could be subject to further work. -Regarding a comparison with traditional safety analysis, the reference Steiner et al. \cite{stekra_21} analyzes the corresponding LPDDR4 system with a traditional FTA approach, reaching a very similar result. -% \begin{itemize} - % \item Zustimmen, dass solche Analysen interessant wären (Fehlerraten verschiedener Basic Events mit gleicher Ursache) - % \item hier system-out-of-context, keine komplettanalyse des Autos -> würde rahmen sprengen - % \item Vergleich mit traditioneller Safety analysis -\> Vielleicht Verweis auf älteres Paper "An LPDDR4 Safety Model for Automotive Applications"? -% \end{itemize} -} - -\reviewer{% -Overall, however, the approach as such is appealing and the aspects mentioned above seem to be solvable with a reasonable amount of effort and time. For safety, a certain rigor is required to pass a safety assessment, while the article leaves the impression of an inappropriate carelessness when it comes to safety calculations. Therefore, it seems highly recommendable that the authors treat the safety analysis and its math with the appropriate rigor and soundness. -} - -\answer{We would like express our appreciation to reviewer for recognizing the appeal of our novel approach and the confidence in its potential. -We also agree with the observation that the approximations involved should be more clearly described in the text, and have therefore overworked such clarifications so that our approach relies extensively on the estimates made in ISO26262.} - -\newpage - \title[Split'n'Cover: ISO\,26262 Hardware Safety Analysis with SystemC]{Split'n'Cover: ISO\,26262 Hardware Safety Analysis with SystemC} %%=============================================================%%