Memory UE Handling and HWpoison Injection

This blog covers userspace memory Uncorrectable Error(UE) handling by the Linux kernel, userspace handler, as well as the test strategy on Intel/AMD platforms. Kernel memory UE handling, memory scrubbing methods, and device memory UE are out of scope.

Suppose during the life time of a mission critical application process, a DIMM cell suddenly goes bad, and the page where the bad cell resides is mapped into a userspace process. And then upon reading the memory location, the UE in the DIMM is consumed. Consequently, a series of things happen in the kernel.

When a CPU consumes a uncorrectable error, a Machine Check exception will be triggered and the CPU becomes trapped. The Linux kernel MCE handler will be invoked to process the MC event.

The MCE handler checks the MCG_STATUS register and MCi_STATUS MSR to determine the next action: Is the UE address valid? Is the UE address in userspace or kernel space ? Is the userspace program restartable? etc.

Let’s examine the console message from the EDAC (Error Detection And Correction) driver on an X9-2 system…

Read Details Here.


Discover more from Vancouver Linux Users Group

Subscribe to get the latest posts sent to your email.

Discover more from Vancouver Linux Users Group

Subscribe now to keep reading and get access to the full archive.

Continue reading