ras.rst 14 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346
  1. Reliability, Availability, and Serviceability (RAS) Extensions
  2. **************************************************************
  3. This document describes |TF-A| support for Arm Reliability, Availability, and
  4. Serviceability (RAS) extensions. RAS is a mandatory extension for Armv8.2 and
  5. later CPUs, and also an optional extension to the base Armv8.0 architecture.
  6. For the description of Arm RAS extensions, Standard Error Records, and the
  7. precise definition of RAS terminology, please refer to the Arm Architecture
  8. Reference Manual and `RAS Supplement`_. The rest of this document assumes
  9. familiarity with architecture and terminology.
  10. **IMPORTANT NOTE**: TF-A implementation assumes that if RAS extension is present
  11. then FEAT_IESB is also implmented.
  12. There are two philosophies for handling RAS errors from Non-secure world point
  13. of view.
  14. - :ref:`Firmware First Handling (FFH)`
  15. - :ref:`Kernel First Handling (KFH)`
  16. .. _Firmware First Handling (FFH):
  17. Firmware First Handling (FFH)
  18. =============================
  19. Introduction
  20. ------------
  21. EA’s and Error interrupts corresponding to NS nodes are handled first in firmware
  22. - Errors signaled back to NS world via suitable mechanism
  23. - Kernel is prohibited from accessing the RAS error records directly
  24. - Firmware creates CPER records for kernel to navigate and process
  25. - Firmware signals error back to Kernel via SDEI
  26. Overview
  27. --------
  28. FFH works in conjunction with `Exception Handling Framework`. Exceptions resulting from
  29. errors in Non-secure world are routed to and handled in EL3. Said errors are Synchronous
  30. External Abort (SEA), Asynchronous External Abort (signalled as SErrors), Fault Handling
  31. and Error Recovery interrupts.
  32. RAS Framework in TF-A allows the platform to define an external abort handler and to
  33. register RAS nodes and interrupts. It also provides `helpers`__ for accessing Standard
  34. Error Records as introduced by the RAS extensions
  35. .. __: `Standard Error Record helpers`_
  36. .. _Kernel First Handling (KFH):
  37. Kernel First Handling (KFH)
  38. ===========================
  39. Introduction
  40. ------------
  41. EA's originating/attributed to NS world are handled first in NS and Kernel navigates
  42. the std error records directly.
  43. - KFH is the default handling mode if platform does not explicitly enable FFH mode.
  44. - KFH mode does not need any EL3 involvement except for the reflection of errors back
  45. to lower EL. This happens when there is an error (EA) in the system which is not yet
  46. signaled to PE while executing at lower EL. During entry into EL3 the errors (EA) are
  47. synchronized causing async EA to pend at EL3.
  48. Error Syncronization at EL3 entry
  49. =================================
  50. During entry to EL3 from lower EL, if there is any pending async EAs they are either
  51. reflected back to lower EL (KFH) or handled in EL3 itself (FFH).
  52. |Image 1|
  53. TF-A build options
  54. ==================
  55. - **ENABLE_FEAT_RAS**: Enable RAS extension feature at EL3.
  56. - **HANDLE_EA_EL3_FIRST_NS**: Required for FFH
  57. - **RAS_TRAP_NS_ERR_REC_ACCESS**: Trap Non-secure access of RAS error record registers.
  58. - **RAS_EXTENSION**: Deprecated macro, equivalent to ENABLE_FEAT_RAS and
  59. HANDLE_EA_EL3_FIRST_NS put together.
  60. RAS internal macros
  61. - **FFH_SUPPORT**: Gets enabled if **HANDLE_EA_EL3_FIRST_NS** is enabled.
  62. RAS feature has dependency on some other TF-A build flags
  63. - **EL3_EXCEPTION_HANDLING**: Required for FFH
  64. - **FAULT_INJECTION_SUPPORT**: Required for testing RAS feature on fvp platform
  65. TF-A Tests
  66. ==========
  67. RAS functionality is regularly tested in TF-A CI using `RAS test group`_ which has multiple
  68. configurations for testing lower EL External aborts.
  69. All the tests are written in TF-A tests which runs as NS-EL2 payload.
  70. - **FFH without RAS extension**
  71. *fvp-ea-ffh,fvp-ea-ffh:fvp-tftf-fip.tftf-aemv8a-debug*
  72. Couple of tests, one each for sync EA and async EA from lower EL which gets handled in El3.
  73. Inject External aborts(sync/async) which traps in EL3, FVP has a handler which gracefully
  74. handles these errors and returns back to TF-A Tests
  75. Build Configs : **HANDLE_EA_EL3_FIRST_NS** , **PLATFORM_TEST_EA_FFH**
  76. - **FFH with RAS extension**
  77. Three Tests :
  78. - *fvp-ras-ffh,fvp-single-fault:fvp-tftf-fip.tftf-aemv8a.fi-debug*
  79. Inject an unrecoverable RAS error, which gets handled in EL3.
  80. - *fvp-ras-ffh,fvp-uncontainable:fvp-tftf.fault-fip.tftf-aemv8a.fi-debug*
  81. Inject uncontainable RAS errors which causes platform to panic.
  82. - *fvp-ras-ffh,fvp-ras-ffh-nested:fvp-tftf-fip.tftf-ras_ffh_nested-aemv8a.fi-debug*
  83. Test nested exception handling at El3 for synchronized async EAs. Inject an SError in lower EL
  84. which remain pending until we enter EL3 through SMC call. At EL3 entry on encountering a pending
  85. async EA it will handle the async EA first (nested exception) before handling the original SMC call.
  86. - **KFH with RAS extension**
  87. Couple of tests in the group :
  88. - *fvp-ras-kfh,fvp-ras-kfh:fvp-tftf-fip.tftf-aemv8a.fi-debug*
  89. Inject and handle RAS errors in TF-A tests (no El3 involvement)
  90. - *fvp-ras-kfh,fvp-ras-kfh-reflect:fvp-tftf-fip.tftf-ras_kfh_reflection-aemv8a.fi-debug*
  91. Reflection of synchronized errors from EL3 to TF-A tests, two tests one each for reflecting
  92. in IRQ and SMC path.
  93. RAS Framework
  94. =============
  95. .. _ras-figure:
  96. .. image:: ../resources/diagrams/draw.io/ras.svg
  97. Platform APIs
  98. -------------
  99. The RAS framework allows the platform to define handlers for External Abort,
  100. Uncontainable Errors, Double Fault, and errors rising from EL3 execution. Please
  101. refer to :ref:`RAS Porting Guide <External Abort handling and RAS Support>`.
  102. Registering RAS error records
  103. -----------------------------
  104. RAS nodes are components in the system capable of signalling errors to PEs
  105. through one one of the notification mechanisms—SEAs, SErrors, or interrupts. RAS
  106. nodes contain one or more error records, which are registers through which the
  107. nodes advertise various properties of the signalled error. Arm recommends that
  108. error records are implemented in the Standard Error Record format. The RAS
  109. architecture allows for error records to be accessible via system or
  110. memory-mapped registers.
  111. The platform should enumerate the error records providing for each of them:
  112. - A handler to probe error records for errors;
  113. - When the probing identifies an error, a handler to handle it;
  114. - For memory-mapped error record, its base address and size in KB; for a system
  115. register-accessed record, the start index of the record and number of
  116. continuous records from that index;
  117. - Any node-specific auxiliary data.
  118. With this information supplied, when the run time firmware receives one of the
  119. notification mechanisms, the RAS framework can iterate through and probe error
  120. records for error, and invoke the appropriate handler to handle it.
  121. The RAS framework provides the macros to populate error record information. The
  122. macros are versioned, and the latest version as of this writing is 1. These
  123. macros create a structure of type ``struct err_record_info`` from its arguments,
  124. which are later passed to probe and error handlers.
  125. For memory-mapped error records:
  126. .. code:: c
  127. ERR_RECORD_MEMMAP_V1(base_addr, size_num_k, probe, handler, aux)
  128. And, for system register ones:
  129. .. code:: c
  130. ERR_RECORD_SYSREG_V1(idx_start, num_idx, probe, handler, aux)
  131. The probe handler must have the following prototype:
  132. .. code:: c
  133. typedef int (*err_record_probe_t)(const struct err_record_info *info,
  134. int *probe_data);
  135. The probe handler must return a non-zero value if an error was detected, or 0
  136. otherwise. The ``probe_data`` output parameter can be used to pass any useful
  137. information resulting from probe to the error handler (see `below`__). For
  138. example, it could return the index of the record.
  139. .. __: `Standard Error Record helpers`_
  140. The error handler must have the following prototype:
  141. .. code:: c
  142. typedef int (*err_record_handler_t)(const struct err_record_info *info,
  143. int probe_data, const struct err_handler_data *const data);
  144. The ``data`` constant parameter describes the various properties of the error,
  145. including the reason for the error, exception syndrome, and also ``flags``,
  146. ``cookie``, and ``handle`` parameters from the :ref:`top-level exception handler
  147. <EL3 interrupts>`.
  148. The platform is expected populate an array using the macros above, and register
  149. the it with the RAS framework using the macro ``REGISTER_ERR_RECORD_INFO()``,
  150. passing it the name of the array describing the records. Note that the macro
  151. must be used in the same file where the array is defined.
  152. Standard Error Record helpers
  153. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  154. The |TF-A| RAS framework provides probe handlers for Standard Error Records, for
  155. both memory-mapped and System Register accesses:
  156. .. code:: c
  157. int ras_err_ser_probe_memmap(const struct err_record_info *info,
  158. int *probe_data);
  159. int ras_err_ser_probe_sysreg(const struct err_record_info *info,
  160. int *probe_data);
  161. When the platform enumerates error records, for those records in the Standard
  162. Error Record format, these helpers maybe used instead of rolling out their own.
  163. Both helpers above:
  164. - Return non-zero value when an error is detected in a Standard Error Record;
  165. - Set ``probe_data`` to the index of the error record upon detecting an error.
  166. Registering RAS interrupts
  167. --------------------------
  168. RAS nodes can signal errors to the PE by raising Fault Handling and/or Error
  169. Recovery interrupts. For the firmware-first handling paradigm for interrupts to
  170. work, the platform must setup and register with |EHF|. See `Interaction with
  171. Exception Handling Framework`_.
  172. For each RAS interrupt, the platform has to provide structure of type ``struct
  173. ras_interrupt``:
  174. - Interrupt number;
  175. - The associated error record information (pointer to the corresponding
  176. ``struct err_record_info``);
  177. - Optionally, a cookie.
  178. The platform is expected to define an array of ``struct ras_interrupt``, and
  179. register it with the RAS framework using the macro
  180. ``REGISTER_RAS_INTERRUPTS()``, passing it the name of the array. Note that the
  181. macro must be used in the same file where the array is defined.
  182. The array of ``struct ras_interrupt`` must be sorted in the increasing order of
  183. interrupt number. This allows for fast look of handlers in order to service RAS
  184. interrupts.
  185. Double-fault handling
  186. ---------------------
  187. A Double Fault condition arises when an error is signalled to the PE while
  188. handling of a previously signalled error is still underway. When a Double Fault
  189. condition arises, the Arm RAS extensions only require for handler to perform
  190. orderly shutdown of the system, as recovery may be impossible.
  191. The RAS extensions part of Armv8.4 introduced new architectural features to deal
  192. with Double Fault conditions, specifically, the introduction of ``NMEA`` and
  193. ``EASE`` bits to ``SCR_EL3`` register. These were introduced to assist EL3
  194. software which runs part of its entry/exit routines with exceptions momentarily
  195. masked—meaning, in such systems, External Aborts/SErrors are not immediately
  196. handled when they occur, but only after the exceptions are unmasked again.
  197. |TF-A|, for legacy reasons, executes entire EL3 with all exceptions unmasked.
  198. This means that all exceptions routed to EL3 are handled immediately. |TF-A|
  199. thus is able to detect a Double Fault conditions in software, without needing
  200. the intended advantages of Armv8.4 Double Fault architecture extensions.
  201. Double faults are fatal, and terminate at the platform double fault handler, and
  202. doesn't return.
  203. Engaging the RAS framework
  204. --------------------------
  205. Enabling RAS support is a platform choice
  206. The RAS support in |TF-A| introduces a default implementation of
  207. ``plat_ea_handler``, the External Abort handler in EL3. When ``ENABLE_FEAT_RAS``
  208. is set to ``1``, it'll first call ``ras_ea_handler()`` function, which is the
  209. top-level RAS exception handler. ``ras_ea_handler`` is responsible for iterating
  210. to through platform-supplied error records, probe them, and when an error is
  211. identified, look up and invoke the corresponding error handler.
  212. Note that, if the platform chooses to override the ``plat_ea_handler`` function
  213. and intend to use the RAS framework, it must explicitly call
  214. ``ras_ea_handler()`` from within.
  215. Similarly, for RAS interrupts, the framework defines
  216. ``ras_interrupt_handler()``. The RAS framework arranges for it to be invoked
  217. when a RAS interrupt taken at EL3. The function bisects the platform-supplied
  218. sorted array of interrupts to look up the error record information associated
  219. with the interrupt number. That error handler for that record is then invoked to
  220. handle the error.
  221. Interaction with Exception Handling Framework
  222. ---------------------------------------------
  223. As mentioned in earlier sections, RAS framework interacts with the |EHF| to
  224. arbitrate handling of RAS exceptions with others that are routed to EL3. This
  225. means that the platform must partition a :ref:`priority level <Partitioning
  226. priority levels>` for handling RAS exceptions. The platform must then define
  227. the macro ``PLAT_RAS_PRI`` to the priority level used for RAS exceptions.
  228. Platforms would typically want to allocate the highest secure priority for
  229. RAS handling.
  230. Handling of both :ref:`interrupt <interrupt-flow>` and :ref:`non-interrupt
  231. <non-interrupt-flow>` exceptions follow the sequences outlined in the |EHF|
  232. documentation. I.e., for interrupts, the priority management is implicit; but
  233. for non-interrupt exceptions, they're explicit using :ref:`EHF APIs
  234. <Activating and Deactivating priorities>`.
  235. --------------
  236. *Copyright (c) 2018-2023, Arm Limited and Contributors. All rights reserved.*
  237. .. _RAS Supplement: https://developer.arm.com/documentation/ddi0587/latest
  238. .. _RAS Test group: https://git.trustedfirmware.org/ci/tf-a-ci-scripts.git/tree/group/tf-l3-boot-tests-ras?h=refs/heads/master
  239. .. |Image 1| image:: ../resources/diagrams/bl31-exception-entry-error-synchronization.png