This blog covers userspace memory Uncorrectable Error(UE) handling by the Linux kernel, userspace handler, as well as the test strategy on Intel/AMD platforms. Kernel memory UE handling, memory scrubbing methods, and device memory UE are out of scope.
Suppose during the life time of a mission critical application process, a DIMM cell suddenly goes bad, and the page where the bad cell resides is mapped into a userspace process. And then upon reading the memory location, the UE in the DIMM is consumed. Consequently, a series of things happen in the kernel.
When a CPU consumes a uncorrectable error, a Machine Check exception will be triggered and the CPU becomes trapped. The Linux kernel MCE handler will be invoked to process the MC event.
The MCE handler checks the MCG_STATUS register and MCi_STATUS MSR to determine the next action: Is the UE address valid? Is the UE address in userspace or kernel space ? Is the userspace program restartable? etc.
Let’s examine the console message from the EDAC (Error Detection And Correction) driver on an X9-2 system…
