A strange thing attracted my attention when I was going through a disassembled KiExecuteAllDpcs. There was no memory fence or barrier on a path that synchronizes DPC insertion in the CPU's DPC list and DPC execution by KiExecuteAllDpcs which might create some problems on high performance CPUs with out-of-order memory read access.
The DPC insertion is synchronized via DpcData member of the KDPC structure. Before a DPC is inserted in the list it is set to non zero value by interlocked operation, so when a DPC is removed from a list it is set to zero again. If DpcData is non zero a DPC had been already inserted in a CPU core's DPC list and KeInsertQueueDpc does nothing,. You should understand that this might be another CPU core's DPC list as each CPU core has its own DPC list in per CPU core's KiProcessorBlock.
Lets look on a disassembled KeInsertQueueDpc and KiExecuteAllDpcs for 64 bit Windows 8. We need to find where DpcData is changed. I took an easy way out by setting a breakpoint on KeInsertQueueDpc, then set a breakpoint on memory write to DpcData for a DPC that is provided as a parameter in RCX register.
So, I started with a breakpoint on KeInsertQueueDpc
kd> bp nt!KeInsertQueueDpc
The breakpoint was hit
kd> g
Breakpoint 0 hit
nt!KeInsertQueueDpc:
fffff802`d34f6420 4889542410 mov qword ptr [rsp+10h],rdx
The DPC address is the first parameter for KeInsertQueueDpc, so it was in RCX register
kd> r rcx
rcx=ffffe00002385830
Lets look on the _KDPC structure stored in a kernel symbols file
+0x000 TargetInfoAsUlong : Uint4B
+0x000 Type : UChar
+0x001 Importance : UChar
+0x002 Number : Uint2B
+0x008 DpcListEntry : _SINGLE_LIST_ENTRY
+0x010 ProcessorHistory : Uint8B
+0x018 DeferredRoutine : Ptr64 void
+0x020 DeferredContext : Ptr64 Void
+0x028 SystemArgument1 : Ptr64 Void
+0x030 SystemArgument2 : Ptr64 Void
+0x038 DpcData : Ptr64 Void
The DpcData is on 7*8 == 0x38 bytes offset from the structure's beginning. Below is a DPC dump
kd> dq ffffe00002385830
ffffe000`02385830 00000000`02850113 00000000`00000000
ffffe000`02385840 00000000`00000020 fffff800`00cb5700
ffffe000`02385850 00000000`00000005 00000000`b0f66d37
ffffe000`02385860 00000000`01cf5513 00000000`00000000
ffffe000`02385870 00000001`40ee0088 ffffe000`02385878
ffffe000`02385880 ffffe000`02385878 00000000`23ba44fc
ffffe000`02385890 ffffe000`04e74950 fffff802`d3773f48
ffffe000`023858a0 11360faa`f68c8fd6 0000000a`00000000
Just FYI, this was a DPC from the TCP subsystem
kd> u fffff800`00cb5700
tcpip!TcpPeriodicTimeoutHandler:
fffff800`00cb5700 48895c2408 mov qword ptr [rsp+8],rbx
fffff800`00cb5705 4889742418 mov qword ptr [rsp+18h],rsi
fffff800`00cb570a 48897c2420 mov qword ptr [rsp+20h],rdi
fffff800`00cb570f 4889542410 mov qword ptr [rsp+10h],rdx
fffff800`00cb5714 55 push rbp
fffff800`00cb5715 4154 push r12
fffff800`00cb5717 4155 push r13
fffff800`00cb5719 4156 push r14
Next I wanted to track where DpcData would be changed, so a breakpoint on memory access was set
kd> ba w8 ffffe00002385830+8*7
kd> bl
0 d fffff802`d34f6420 0001 (0001) nt!KeInsertQueueDpc
1 e ffffe000`02385868 w 8 0001 (0001)
I did not need the breakpoint on KeInsertQueueDpc, so it was disabled.
Eventually the memory watch breakpoint was hit
kd> g
Breakpoint 1 hit
nt!KeInsertQueueDpc+0xf4:
fffff802`d34f6514 753a jne nt!KeInsertQueueDpc+0x130 (fffff802`d34f6550)
Lets check the IRQL
kd> !irql
Debugger saved IRQL for processor 0x0 -- 15 (HIGH_LEVEL)
This is the highest level for 64 bit x86 architecture, so the scheduler and all interrupts were disabled for the current CPU core.
Ok, I found a place where DpcData was changed by InterlockedCompareExchangePointer to a nonzero value. Lets look at a couple of instructions before and after this point. The InterlockedCompareExchangePointer is implemented by the lock cmpxchg instruction that represents a memory barrier by itself, so a compiler do not rearrange instructions around this call and a CPU retires all instruction before a barrier and do not allow out-of-order execution crossing the barrier.
nt!KeInsertQueueDpc+0xf4
fffff802`d34f6504 443b4d24 cmp r9d,dword ptr [rbp+24h]
fffff802`d34f6508 480f45c8 cmovne rcx,rax
fffff802`d34f650c 33c0 xor eax,eax
fffff802`d34f650e f0480fb14e38 lock cmpxchg qword ptr [rsi+38h],rcx <======= here!
fffff802`d34f6514 753a jne nt!KeInsertQueueDpc+0x130 (fffff802`d34f6550)
fffff802`d34f6516 ff4718 inc dword ptr [rdi+18h]
fffff802`d34f6519 ff471c inc dword ptr [rdi+1Ch]
fffff802`d34f651c 48895628 mov qword ptr [rsi+28h],rdx
fffff802`d34f6520 4c896e30 mov qword ptr [rsi+30h],r13
fffff802`d34f6524 4584e4 test r12b,r12b
Execution was continued to find a place where the DpcData would be changed back to zero, here it is,
kd> g
Breakpoint 1 hit
nt!KiExecuteAllDpcs+0x10d:
fffff802`d34dbc6d ff4b18 dec dword ptr [rbx+18h]
As you might know WinDBG points to the next instruction. Lets look around this point.
nt!KiExecuteAllDpcs+0x10d:
fffff802`d34dbc59 4c8b5720 mov r10,qword ptr [rdi+20h]
fffff802`d34dbc5d 4c8b4728 mov r8,qword ptr [rdi+28h]
fffff802`d34dbc61 4c8b4f30 mov r9,qword ptr [rdi+30h]
fffff802`d34dbc65 4c8b6f38 mov r13,qword ptr [rdi+38h]
fffff802`d34dbc69 48897738 mov qword ptr [rdi+38h],rsi <======== here!
fffff802`d34dbc6d ff4b18 dec dword ptr [rbx+18h] ds:002b:ffffd000`207bbf18=00000001
check that the RSI registry was zero as it was written in DpcData
kd> r rsi
rsi=0000000000000000
and again check the IRQL
kd> !irql
Debugger saved IRQL for processor 0x5 -- 2 (DISPATCH_LEVEL)
This was a DPC retirement by an idle thread, pretty common for a not busy system
kd> kn
# Child-SP RetAddr Call Site
00 ffffd000`207e39a0 fffff802`d34db9f0 nt!KiExecuteAllDpcs+0x10d
01 ffffd000`207e3af0 fffff802`d35d27ea nt!KiRetireDpcList+0xd0
02 ffffd000`207e3c60 00000000`00000000 nt!KiIdleLoop+0x5a
Now it is time to digest the information. If the above assembler instructions are being translated to a high level language like C we would have
BOOLEAN
KeInsertQueueDpc (
__inout PRKDPC Dpc,
__in_opt PVOID SystemArgument1,
__in_opt PVOID SystemArgument2
)
{
// prevent any interrupt on this CPU core
KeRaiseIrql(HIGH_LEVEL, &OldIrql);
.....
// a per CPU core lock for DPC queue
KiAcquireSpinLock(&PerCpuDpcData->DpcLock);
.....
// acquire the DPC by changing DpcData, interlocked functions pose a memory barrier
if (InterlockedCompareExchangePointer(&Dpc->DpcData, PerCpuDpcData, NULL) == NULL) {
.......
/*
fffff802`d34f651c 48895628 mov qword ptr [rsi+28h],rdx
fffff802`d34f6520 4c896e30 mov qword ptr [rsi+30h],r13
*/
Dpc->SystemArgument1 = SystemArgument1;
Dpc->SystemArgument2 = SystemArgument2;
// insert in the list
InsertHeadList(&PerCpuDpcData->DpcListHead, &Dpc->DpcListEntry);
} else {
// do nothing
}
KiReleaseSpinLock(&PerCpuDpcData->DpcLock);
KeLowerIrql(OldIrql);
}
VOID
KiExecuteAllDpcs()
{
// a per CPU core lock for DPC queue
KeAcquireSpinLockAtDpcLevel(&PerCpuDpcData->DpcLock);
......
RemoveEntryList(Entry);
Dpc = CONTAINING_RECORD(Entry, KDPC, DpcListEntry);
......
/*
fffff802`d34dbc59 4c8b5720 mov r10,qword ptr [rdi+20h]
fffff802`d34dbc5d 4c8b4728 mov r8,qword ptr [rdi+28h]
*/
SystemArgument1 = Dpc->SystemArgument1;
SystemArgument2 = Dpc->SystemArgument2;
/*
fffff802`d34dbc65 4c8b6f38 mov r13,qword ptr [rdi+38h]
fffff802`d34dbc69 48897738 mov qword ptr [rdi+38h],rsi
*/
Dpc->DpcData = NULL;
KeReleaseSpinLockFromDpcLevel(&PerCpuDpcData->DpcLock);
}
Lets discuss what we have at this point. Actually, the synchronisation between KeInsertQueueDpc and KiExecuteAllDpcs does not look flawless. The problem is in out-of-order data access and compiler optimisation. In the current case the compiler did not rearrange memory access in KiExecuteAllDpcs but nothing prevents it in the future to schedule Dpc->DpcData = NULL before SystemArgument1 = Dpc->SystemArgument1; or SystemArgument2 = Dpc->SystemArgument2; or both as there is no data or control dependency between them. The situation is exacerbated by modern CPUs out-of-order memory access for reading, it is legitimate for a CPU to write NULL in Dpc->DpcData before reading Dpc->SystemArgument1 or Dpc->SystemArgument2 as again there is no data or control dependency here and it is possible that the instructions are retired out of order. In that case if there is a concurrent KeInsertQueueDpc on another CPU core there is a probability ( though extremely low ) that KiExecuteAllDpcs picks a wrong SystemArgument1 or SystemArgument2 that has been just rewritten by KeInsertQueueDpc after InterlockedCompareExchangePointer successfully changed DpcData to a non zero value. Below is a table of memory reordering in some architectures, IA 64 and ARM look pretty scary for this case, Microsoft is lucky at least by dropping IA 64 support.
Type | Alpha | ARMv7 | PA-RISC | POWER | SPARC RMO | SPARC PSO | SPARC TSO | x86 | x86 oostore | AMD64 | IA-64 | zSeries |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Loads reordered after loads | Y | Y | Y | Y | Y | Y | Y | |||||
Loads reordered after stores | Y | Y | Y | Y | Y | Y | Y | |||||
Stores reordered after stores | Y | Y | Y | Y | Y | Y | Y | Y | ||||
Stores reordered after loads | Y | Y | Y | Y | Y | Y | Y | Y | Y | Y | Y | Y |
Atomic reordered with loads | Y | Y | Y | Y | Y | |||||||
Atomic reordered with stores | Y | Y | Y | Y | Y | Y | ||||||
Dependent loads reordered | Y | |||||||||||
Incoherent instruction cache pipeline | Y | Y | Y | Y | Y | Y | Y | Y | Y | Y |
The corrected code should use some type of memory fence or barrier, like the code below. I hope Microsoft did this for ARM, but in any case this should have been done for x86, x86-64 and IA 64, especially the last one.
VOID
KiExecuteAllDpcs()
{
// a per CPU core lock for DPC queue
KeAcquireSpinLockAtDpcLevel(&PerCpuDpcData->DpcLock);
......
RemoveEntryList(Entry);
Dpc = CONTAINING_RECORD(Entry, KDPC, DpcListEntry);
......
/*
fffff802`d34dbc59 4c8b5720 mov r10,qword ptr [rdi+20h]
fffff802`d34dbc5d 4c8b4728 mov r8,qword ptr [rdi+28h]
*/
SystemArgument1 = Dpc->SystemArgument1;
SystemArgument2 = Dpc->SystemArgument2;
KeMemoryBarrier();
/*
fffff802`d34dbc65 4c8b6f38 mov r13,qword ptr [rdi+38h]
fffff802`d34dbc69 48897738 mov qword ptr [rdi+38h],rsi
*/
Dpc->DpcData = NULL;
KeReleaseSpinLockFromDpcLevel(&PerCpuDpcData->DpcLock);
}
The rule for such synchronization is in employing a memory barrier when a resource is being released if a resource acquisition was made by an interlocked operation.
Acquire( SomeValue )
{
// interlocked operations is a barrier for both a CPU and a compiler
if( InterlockedCompareExchange( &Data->Resource, 0x1, 0x0 ) == 0 ){
// the resource has been acquired
Data->Value = SomeValue;
}
}
.........
Release()
{
// get the value when the resource is held
StackedValue = Data->Value;
// tell a CPU to retire all preceding operations, this is also a compiler barrier
KeMemoryBarrier();
// release the resource
Data->Resource = 0x0;
// Do something with the local StackedValue
foo( StackedValue );
}
P.S. If you look at the Windows NT open source clone - ReactOS the situation there is not better, look at KiRetireDpcList ( there is no KiExecuteAllDpcs which was introduced somewhere in Vista or 8 )
/* Clear its DPC data and save its parameters */
KiExecuteAllDpcs()
{
// a per CPU core lock for DPC queue
KeAcquireSpinLockAtDpcLevel(&PerCpuDpcData->DpcLock);
......
RemoveEntryList(Entry);
Dpc = CONTAINING_RECORD(Entry, KDPC, DpcListEntry);
......
/*
fffff802`d34dbc59 4c8b5720 mov r10,qword ptr [rdi+20h]
fffff802`d34dbc5d 4c8b4728 mov r8,qword ptr [rdi+28h]
*/
SystemArgument1 = Dpc->SystemArgument1;
SystemArgument2 = Dpc->SystemArgument2;
KeMemoryBarrier();
/*
fffff802`d34dbc65 4c8b6f38 mov r13,qword ptr [rdi+38h]
fffff802`d34dbc69 48897738 mov qword ptr [rdi+38h],rsi
*/
Dpc->DpcData = NULL;
KeReleaseSpinLockFromDpcLevel(&PerCpuDpcData->DpcLock);
}
The rule for such synchronization is in employing a memory barrier when a resource is being released if a resource acquisition was made by an interlocked operation.
Acquire( SomeValue )
{
// interlocked operations is a barrier for both a CPU and a compiler
if( InterlockedCompareExchange( &Data->Resource, 0x1, 0x0 ) == 0 ){
// the resource has been acquired
Data->Value = SomeValue;
}
}
.........
Release()
{
// get the value when the resource is held
StackedValue = Data->Value;
// tell a CPU to retire all preceding operations, this is also a compiler barrier
KeMemoryBarrier();
// release the resource
Data->Resource = 0x0;
// Do something with the local StackedValue
foo( StackedValue );
}
P.S. If you look at the Windows NT open source clone - ReactOS the situation there is not better, look at KiRetireDpcList ( there is no KiExecuteAllDpcs which was introduced somewhere in Vista or 8 )
/* Clear its DPC data and save its parameters */
Dpc->DpcData = NULL; DeferredRoutine = Dpc->DeferredRoutine; DeferredContext = Dpc->DeferredContext; SystemArgument1 = Dpc->SystemArgument1; SystemArgument2 = Dpc->SystemArgument2;after Dpc->DpcData = NULL a concurent KeInsertQueueDpc starts writing new values to Dpc->SystemArgument1 and Dpc->SystemArgument2 before KiRetireDpcList has fetched their consistent values. Without memory barrier there is a room for compiler optimisation by rearranging store and loads and for CPU optimisation.