ras.rst 10 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241
  1. Reliability, Availability, and Serviceability (RAS) Extensions
  2. ==============================================================
  3. This document describes |TF-A| support for Arm Reliability, Availability, and
  4. Serviceability (RAS) extensions. RAS is a mandatory extension for Armv8.2 and
  5. later CPUs, and also an optional extension to the base Armv8.0 architecture.
  6. In conjunction with the |EHF|, support for RAS extension enables firmware-first
  7. paradigm for handling platform errors: exceptions resulting from errors are
  8. routed to and handled in EL3. Said errors are Synchronous External Abort (SEA),
  9. Asynchronous External Abort (signalled as SErrors), Fault Handling and Error
  10. Recovery interrupts. The |EHF| document mentions various :ref:`error handling
  11. use-cases <delegation-use-cases>` .
  12. For the description of Arm RAS extensions, Standard Error Records, and the
  13. precise definition of RAS terminology, please refer to the Arm Architecture
  14. Reference Manual. The rest of this document assumes familiarity with
  15. architecture and terminology.
  16. Overview
  17. --------
  18. As mentioned above, the RAS support in |TF-A| enables routing to and handling of
  19. exceptions resulting from platform errors in EL3. It allows the platform to
  20. define an External Abort handler, and to register RAS nodes and interrupts. RAS
  21. framework also provides `helpers`__ for accessing Standard Error Records as
  22. introduced by the RAS extensions.
  23. .. __: `Standard Error Record helpers`_
  24. The build option ``RAS_EXTENSION`` when set to ``1`` includes the RAS in run
  25. time firmware; ``EL3_EXCEPTION_HANDLING`` and ``HANDLE_EA_EL3_FIRST`` must also
  26. be set ``1``. ``RAS_TRAP_NS_ERR_REC_ACCESS`` controls the access to the RAS
  27. error record registers from Non-secure.
  28. .. _ras-figure:
  29. .. image:: ../resources/diagrams/draw.io/ras.svg
  30. See more on `Engaging the RAS framework`_.
  31. Platform APIs
  32. -------------
  33. The RAS framework allows the platform to define handlers for External Abort,
  34. Uncontainable Errors, Double Fault, and errors rising from EL3 execution. Please
  35. refer to :ref:`RAS Porting Guide <External Abort handling and RAS Support>`.
  36. Registering RAS error records
  37. -----------------------------
  38. RAS nodes are components in the system capable of signalling errors to PEs
  39. through one one of the notification mechanisms—SEAs, SErrors, or interrupts. RAS
  40. nodes contain one or more error records, which are registers through which the
  41. nodes advertise various properties of the signalled error. Arm recommends that
  42. error records are implemented in the Standard Error Record format. The RAS
  43. architecture allows for error records to be accessible via system or
  44. memory-mapped registers.
  45. The platform should enumerate the error records providing for each of them:
  46. - A handler to probe error records for errors;
  47. - When the probing identifies an error, a handler to handle it;
  48. - For memory-mapped error record, its base address and size in KB; for a system
  49. register-accessed record, the start index of the record and number of
  50. continuous records from that index;
  51. - Any node-specific auxiliary data.
  52. With this information supplied, when the run time firmware receives one of the
  53. notification mechanisms, the RAS framework can iterate through and probe error
  54. records for error, and invoke the appropriate handler to handle it.
  55. The RAS framework provides the macros to populate error record information. The
  56. macros are versioned, and the latest version as of this writing is 1. These
  57. macros create a structure of type ``struct err_record_info`` from its arguments,
  58. which are later passed to probe and error handlers.
  59. For memory-mapped error records:
  60. .. code:: c
  61. ERR_RECORD_MEMMAP_V1(base_addr, size_num_k, probe, handler, aux)
  62. And, for system register ones:
  63. .. code:: c
  64. ERR_RECORD_SYSREG_V1(idx_start, num_idx, probe, handler, aux)
  65. The probe handler must have the following prototype:
  66. .. code:: c
  67. typedef int (*err_record_probe_t)(const struct err_record_info *info,
  68. int *probe_data);
  69. The probe handler must return a non-zero value if an error was detected, or 0
  70. otherwise. The ``probe_data`` output parameter can be used to pass any useful
  71. information resulting from probe to the error handler (see `below`__). For
  72. example, it could return the index of the record.
  73. .. __: `Standard Error Record helpers`_
  74. The error handler must have the following prototype:
  75. .. code:: c
  76. typedef int (*err_record_handler_t)(const struct err_record_info *info,
  77. int probe_data, const struct err_handler_data *const data);
  78. The ``data`` constant parameter describes the various properties of the error,
  79. including the reason for the error, exception syndrome, and also ``flags``,
  80. ``cookie``, and ``handle`` parameters from the :ref:`top-level exception handler
  81. <EL3 interrupts>`.
  82. The platform is expected populate an array using the macros above, and register
  83. the it with the RAS framework using the macro ``REGISTER_ERR_RECORD_INFO()``,
  84. passing it the name of the array describing the records. Note that the macro
  85. must be used in the same file where the array is defined.
  86. Standard Error Record helpers
  87. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  88. The |TF-A| RAS framework provides probe handlers for Standard Error Records, for
  89. both memory-mapped and System Register accesses:
  90. .. code:: c
  91. int ras_err_ser_probe_memmap(const struct err_record_info *info,
  92. int *probe_data);
  93. int ras_err_ser_probe_sysreg(const struct err_record_info *info,
  94. int *probe_data);
  95. When the platform enumerates error records, for those records in the Standard
  96. Error Record format, these helpers maybe used instead of rolling out their own.
  97. Both helpers above:
  98. - Return non-zero value when an error is detected in a Standard Error Record;
  99. - Set ``probe_data`` to the index of the error record upon detecting an error.
  100. Registering RAS interrupts
  101. --------------------------
  102. RAS nodes can signal errors to the PE by raising Fault Handling and/or Error
  103. Recovery interrupts. For the firmware-first handling paradigm for interrupts to
  104. work, the platform must setup and register with |EHF|. See `Interaction with
  105. Exception Handling Framework`_.
  106. For each RAS interrupt, the platform has to provide structure of type ``struct
  107. ras_interrupt``:
  108. - Interrupt number;
  109. - The associated error record information (pointer to the corresponding
  110. ``struct err_record_info``);
  111. - Optionally, a cookie.
  112. The platform is expected to define an array of ``struct ras_interrupt``, and
  113. register it with the RAS framework using the macro
  114. ``REGISTER_RAS_INTERRUPTS()``, passing it the name of the array. Note that the
  115. macro must be used in the same file where the array is defined.
  116. The array of ``struct ras_interrupt`` must be sorted in the increasing order of
  117. interrupt number. This allows for fast look of handlers in order to service RAS
  118. interrupts.
  119. Double-fault handling
  120. ---------------------
  121. A Double Fault condition arises when an error is signalled to the PE while
  122. handling of a previously signalled error is still underway. When a Double Fault
  123. condition arises, the Arm RAS extensions only require for handler to perform
  124. orderly shutdown of the system, as recovery may be impossible.
  125. The RAS extensions part of Armv8.4 introduced new architectural features to deal
  126. with Double Fault conditions, specifically, the introduction of ``NMEA`` and
  127. ``EASE`` bits to ``SCR_EL3`` register. These were introduced to assist EL3
  128. software which runs part of its entry/exit routines with exceptions momentarily
  129. masked—meaning, in such systems, External Aborts/SErrors are not immediately
  130. handled when they occur, but only after the exceptions are unmasked again.
  131. |TF-A|, for legacy reasons, executes entire EL3 with all exceptions unmasked.
  132. This means that all exceptions routed to EL3 are handled immediately. |TF-A|
  133. thus is able to detect a Double Fault conditions in software, without needing
  134. the intended advantages of Armv8.4 Double Fault architecture extensions.
  135. Double faults are fatal, and terminate at the platform double fault handler, and
  136. doesn't return.
  137. Engaging the RAS framework
  138. --------------------------
  139. Enabling RAS support is a platform choice constructed from three distinct, but
  140. related, build options:
  141. - ``RAS_EXTENSION=1`` includes the RAS framework in the run time firmware;
  142. - ``EL3_EXCEPTION_HANDLING=1`` enables handling of exceptions at EL3. See
  143. `Interaction with Exception Handling Framework`_;
  144. - ``HANDLE_EA_EL3_FIRST=1`` enables routing of External Aborts and SErrors to
  145. EL3.
  146. The RAS support in |TF-A| introduces a default implementation of
  147. ``plat_ea_handler``, the External Abort handler in EL3. When ``RAS_EXTENSION``
  148. is set to ``1``, it'll first call ``ras_ea_handler()`` function, which is the
  149. top-level RAS exception handler. ``ras_ea_handler`` is responsible for iterating
  150. to through platform-supplied error records, probe them, and when an error is
  151. identified, look up and invoke the corresponding error handler.
  152. Note that, if the platform chooses to override the ``plat_ea_handler`` function
  153. and intend to use the RAS framework, it must explicitly call
  154. ``ras_ea_handler()`` from within.
  155. Similarly, for RAS interrupts, the framework defines
  156. ``ras_interrupt_handler()``. The RAS framework arranges for it to be invoked
  157. when a RAS interrupt taken at EL3. The function bisects the platform-supplied
  158. sorted array of interrupts to look up the error record information associated
  159. with the interrupt number. That error handler for that record is then invoked to
  160. handle the error.
  161. Interaction with Exception Handling Framework
  162. ---------------------------------------------
  163. As mentioned in earlier sections, RAS framework interacts with the |EHF| to
  164. arbitrate handling of RAS exceptions with others that are routed to EL3. This
  165. means that the platform must partition a :ref:`priority level <Partitioning
  166. priority levels>` for handling RAS exceptions. The platform must then define
  167. the macro ``PLAT_RAS_PRI`` to the priority level used for RAS exceptions.
  168. Platforms would typically want to allocate the highest secure priority for
  169. RAS handling.
  170. Handling of both :ref:`interrupt <interrupt-flow>` and :ref:`non-interrupt
  171. <non-interrupt-flow>` exceptions follow the sequences outlined in the |EHF|
  172. documentation. I.e., for interrupts, the priority management is implicit; but
  173. for non-interrupt exceptions, they're explicit using :ref:`EHF APIs
  174. <Activating and Deactivating priorities>`.
  175. --------------
  176. *Copyright (c) 2018-2019, Arm Limited and Contributors. All rights reserved.*