Showing posts with label Memory Manager. Show all posts
Showing posts with label Memory Manager. Show all posts

Saturday, July 29, 2017

Windows developers' misconception about UNIX.

While reading osronline.com forum on Windows file system development I ran into a common misconception among Windows developers regarding UNIX design. http://osronline.com/cf.cfm?PageURL=showThread.CFM?link=285260
<QUOTE>
The essential difference between how the NT kernel works and how Unix was
designed is that NT caches streams of data (above the file system), whereas
on Unix data is cached at the block layer.
</QUOTE>
I spent 5 minutes to bust it.
This is true only for ancient *NIX kernels. Modern kernels use the same technique as NT with caching backed by file mapping structures.
For example below is a call stack from my test machine running the Linux kernel (4.12.2) when ext4 read operation (ext4_file_read_iter) called the "Linux cache manager" ( do_generic_file_read -> page_cache_sync_readahead ) to bring data in the cache backed by mapped file structures( struct address_space ) when processing the read() system call.
This resulted in a recursive call to mapping->a_ops->readpages into a file system's ext4_readpages . This is an analogue of a cached read in NT. Mac OS X uses the same caching by file mapping technique borrowed from BSD.
(gdb) bt
#0  ext4_readpages (file=0xffff88001d59b300, mapping=0xffff88001d1d56c0, pages=0xffffc90000817c30, nr_pages=1) at ../fs/ext4/inode.c:3308
#1  0xffffffff811b6288 in read_pages (gfp=<optimised out>, nr_pages=<optimised out>, pages=<optimised out>, filp=<optimised out>, mapping=<optimised out>) at ../mm/readahead.c:121
#2  __do_page_cache_readahead (mapping=<optimised out>, filp=<optimised out>, offset=1, nr_to_read=<optimised out>, lookahead_size=<optimised out>) at ../mm/readahead.c:199
#3  0xffffffff811b64b8 in ra_submit (ra=<optimised out>, ra=<optimised out>, ra=<optimised out>, filp=<optimised out>, mapping=<optimised out>) at ../mm/internal.h:66
#4  ondemand_readahead (mapping=0xffff88001d1d56c0, ra=0xffff88001d59b398, filp=0xffff88001d59b300, hit_readahead_marker=<optimised out>, offset=0, req_size=<optimised out>) at ../mm/readahead.c:478
#5  0xffffffff811b678e in page_cache_sync_readahead (mapping=<optimised out>, ra=<optimised out>, filp=<optimised out>, offset=<optimised out>, req_size=<optimised out>) at ../mm/readahead.c:510
#6  0xffffffff811a7a62 in do_generic_file_read (written=<optimised out>, iter=<optimised out>, ppos=<optimised out>, filp=<optimised out>) at ../mm/filemap.c:1813
#7  generic_file_read_iter (iocb=0x20000, iter=<optimised out>) at ../mm/filemap.c:2069
#8  0xffffffff812d1386 in ext4_file_read_iter (iocb=0xffff88001d59b300, to=0xffff88001d1d56c0) at ../fs/ext4/file.c:70
#9  0xffffffff81237680 in call_read_iter (file=<optimised out>, iter=<optimised out>, kio=<optimised out>) at ../include/linux/fs.h:1728
#10 new_sync_read (ppos=<optimised out>, len=<optimised out>, buf=<optimised out>, filp=<optimised out>) at ../fs/read_write.c:440
#11 __vfs_read (file=0xffff88001d59b300, buf=<optimised out>, count=<optimised out>, pos=0xffffc90000817f18) at ../fs/read_write.c:452
#12 0xffffffff81237cc3 in vfs_read (file=0xffff88001d59b300, buf=0x7fb92a0cb000 <error: Cannot access memory at address 0x7fb92a0cb000>, count=<optimised out>, pos=0xffffc90000817f18)
    at ../fs/read_write.c:473
#13 0xffffffff81239385 in SYSC_read (count=<optimised out>, buf=<optimised out>, fd=<optimised out>) at ../fs/read_write.c:589
#14 SyS_read (fd=<optimised out>, buf=140433251151872, count=131072) at ../fs/read_write.c:582
#15 0xffffffff818aaffb in entry_SYSCALL_64 () at ../arch/x86/entry/entry_64.S:203

(gdb) f 4
#4  ondemand_readahead (mapping=0xffff88001d1d56c0, ra=0xffff88001d59b398, filp=0xffff88001d59b300, hit_readahead_marker=<optimised out>, offset=0, req_size=<optimised out>) at ../mm/readahead.c:478
478  return ra_submit(ra, mapping, filp);

(gdb) p/x *mapping
$14 = {host = 0xffff88001d1d5548, page_tree = {gfp_mask = 0x1180020, rnode = 0x0}, tree_lock = {{rlock = {raw_lock = {val = {counter = 0x0}}}}}, i_mmap_writable = {counter = 0x0}, i_mmap = {
    rb_node = 0x0}, i_mmap_rwsem = {count = {counter = 0x0}, wait_list = {next = 0xffff88001d1d56f0, prev = 0xffff88001d1d56f0}, wait_lock = {raw_lock = {val = {counter = 0x0}}}, osq = {tail = {
        counter = 0x0}}, owner = 0x0}, nrpages = 0x0, nrexceptional = 0x0, writeback_index = 0x0, a_ops = 0xffffffff81a3a680, flags = 0x0, private_lock = {{rlock = {raw_lock = {val = {
            counter = 0x0}}}}}, gfp_mask = 0x14200ca, private_list = {next = 0xffff88001d1d5740, prev = 0xffff88001d1d5740}, private_data = 0x0}
            
(gdb) ptype mapping
type = struct address_space {
    struct inode *host;
    struct radix_tree_root page_tree;
    spinlock_t tree_lock;
    atomic_t i_mmap_writable;
    struct rb_root i_mmap;
    struct rw_semaphore i_mmap_rwsem;
    unsigned long nrpages;
    unsigned long nrexceptional;
    unsigned long writeback_index;
    const struct address_space_operations *a_ops;
    unsigned long flags;
    spinlock_t private_lock;
    gfp_t gfp_mask;
    struct list_head private_list;
    void *private_data;
} *

(gdb) f 1
#1  0xffffffff811b6288 in read_pages (gfp=<optimised out>, nr_pages=<optimised out>, pages=<optimised out>, filp=<optimised out>, mapping=<optimised out>) at ../mm/readahead.c:121
121   ret = mapping->a_ops->readpages(filp, mapping, pages, nr_pages);
(gdb) l
116  int ret;
117 
118  blk_start_plug(&plug);
119 
120  if (mapping->a_ops->readpages) {
121   ret = mapping->a_ops->readpages(filp, mapping, pages, nr_pages);
122   /* Clean up the remaining pages */
123   put_pages_list(pages);
124   goto out;
125  }

(gdb) f 9
#9  0xffffffff81237680 in call_read_iter (file=<optimised out>, iter=<optimised out>, kio=<optimised out>) at ../include/linux/fs.h:1728
1728  return file->f_op->read_iter(kio, iter);
(gdb) l
1723 } ____cacheline_aligned;
1724 
1725 static inline ssize_t call_read_iter(struct file *file, struct kiocb *kio,
1726          struct iov_iter *iter)
1727 {
1728  return file->f_op->read_iter(kio, iter);
1729 }
1730 
1731 static inline ssize_t call_write_iter(struct file *file, struct kiocb *kio,
1732           struct iov_iter *iter)
(gdb) 

Thursday, June 8, 2017

Windows. Cache prefetching


00 nt!KiSwapContext
01 nt!KiSwapThread
02 nt!KiCommitThreadWait
03 nt!KeWaitForSingleObject
04 nt!MiWaitForInPageComplete
05 nt!MiPfCompleteInPageSupport
06 nt!MiPfCompletePrefetchIos
07 nt!MmWaitForCacheManagerPrefetch
08 nt!CcFetchDataForRead
09 nt!CcMapAndCopyFromCache
0a nt!CcCopyReadEx
0b nt!CcCopyRead

Monday, April 3, 2017

TLB flushing call on Windows

nt!KiRetireDpcList+0xd7
nt!KxRetireDpcList+0x5 (TrapFrame @ fffff800`cc332e70)
nt!KiDispatchInterruptContinue
nt!KiDpcInterrupt+0xca (TrapFrame @ ffffd000`a9b34d90)
nt!MiFlushTbList+0x20c
nt!MiDeleteSystemPagableVm+0x4d9
nt!MiPurgeSpecialPoolPaged+0x18
nt!MmFreeSpecialPool+0x3cf
nt!ExDeferredFreePool+0x677
nt!VerifierExFreePoolWithTag+0x44

Tuesday, March 7, 2017

RISC-V Linux kernel memory initialization on boot.

This text is based on memory-initialization.md from my GitHub repo  riscv-notes
The kernel is started with virtual memory initialized by machine level bootloader BBL. The more detailed description can be found in this document - supervisor_vm_init.md .
The kernel start offset is defined in linux/linux-4.6.2/arch/riscv/include/asm/page.h
/*
 * PAGE_OFFSET -- the first address of the first page of memory.
 * When not using MMU this corresponds to the first free page in
 * physical memory (aligned on a page boundary).
 */
#ifdef CONFIG_64BIT
#define PAGE_OFFSET     _AC(0xffffffff80000000,UL)
#else
#define PAGE_OFFSET     _AC(0xc0000000,UL)
#endif
BBL initializes virtual memory for supervisor mode, maps the Linux kernel at PAGE_OFFSET, sets sptbr register value to a root page table physical address, switches to the supervisor mode with $pc set to the entry point _start. BBL does this in enter_supervisor_mode function defined in riscv-tools/riscv-pk/machine/minit.c
void enter_supervisor_mode(void (*fn)(uintptr_t), uintptr_t stack)
{
  uintptr_t mstatus = read_csr(mstatus);
  mstatus = INSERT_FIELD(mstatus, MSTATUS_MPP, PRV_S);
  mstatus = INSERT_FIELD(mstatus, MSTATUS_MPIE, 0);
  write_csr(mstatus, mstatus);
  write_csr(mscratch, MACHINE_STACK_TOP() - MENTRY_FRAME_SIZE);
  write_csr(mepc, fn);
  write_csr(sptbr, (uintptr_t)root_page_table >> RISCV_PGSHIFT);
  asm volatile ("mv a0, %0; mv sp, %0; mret" : : "r" (stack));
  __builtin_unreachable();
}
The important difference between RISC-V case and many other CPUs( e.g. x86 )is that Linux kernel's entry point is called with virtual memory initialized by boot loader executing at higher privilege mode.
The memory management is initialized inside setup_arch routine defined in linux/linux-4.6.2/arch/riscv/kernel/setup.c, below only memory management relevant part of the function is shown
void __init setup_arch(char **cmdline_p)
{
...
    init_mm.start_code = (unsigned long) _stext;
    init_mm.end_code   = (unsigned long) _etext;
    init_mm.end_data   = (unsigned long) _edata;
    init_mm.brk        = (unsigned long) _end;

    setup_bootmem();
    ....
    paging_init();
    ....
}
The _stext, _etext, _edata, _end global variables are defined in the linker script linux/linux-4.6.2/arch/riscv/kernel/vmlinux.lds.S which defines the kernel memory layout. These variables defines the kernel section borders. The thorough description regarding linkers scripts can be found here https://sourceware.org/binutils/docs/ld/Scripts.html .
The first function being called is setup_bootmem
static void __init setup_bootmem(void)
{
    unsigned long ret;
    memory_block_info info;

    ret = sbi_query_memory(0, &info);
    BUG_ON(ret != 0);
    BUG_ON((info.base & ~PMD_MASK) != 0);
    BUG_ON((info.size & ~PMD_MASK) != 0);
    pr_info("Available physical memory: %ldMB\n", info.size >> 20);

    /* The kernel image is mapped at VA=PAGE_OFFSET and PA=info.base */
    va_pa_offset = PAGE_OFFSET - info.base;
    pfn_base = PFN_DOWN(info.base);

    if ((mem_size != 0) && (mem_size < info.size)) {
        memblock_enforce_memory_limit(mem_size);
        info.size = mem_size;
        pr_notice("Physical memory usage limited to %lluMB\n",
            (unsigned long long)(mem_size >> 20));
    }
    set_max_mapnr(PFN_DOWN(info.size));
    max_low_pfn = PFN_DOWN(info.base + info.size);

#ifdef CONFIG_BLK_DEV_INITRD
    setup_initrd();
#endif /* CONFIG_BLK_DEV_INITRD */

    memblock_reserve(info.base, __pa(_end) - info.base);
    reserve_boot_page_table(pfn_to_virt(csr_read(sptbr)));
    memblock_allow_resize();
}
The Linux kernel queries the available memory size in setup_bootmem by invoking SBI interface's sbi_query_memorywhich results in a call to __sbi_query_memory BBL routine executed (suprisingly) in supervisor mode as SBI has been mapped to the supervisor virtual address space and ecall instruction is not invoked for sbi_query_memory
uintptr_t __sbi_query_memory(uintptr_t id, memory_block_info *p)
{
  if (id == 0) {
    p->base = first_free_paddr;
    p->size = mem_size + DRAM_BASE - p->base;
    return 0;
  }

  return -1;
}
More about SBI can be found here https://github.com/slavaim/riscv-notes/blob/master/bbl/sbi-to-linux.md .
The kernel reserves the pages occupied by the kernel with a call to memblock_reserve(info.base, __pa(_end) - info.base); . Then a call to reserve_boot_page_table(pfn_to_virt(csr_read(sptbr))); reserves the pages occupied by the page table allocated by the bootloader, i.e. BBL.The Linux kernel retrieves the page table allocated and initialized by BBL by reading a physical address from the sptbr register and converting it to a virtual address. The page table virtual address is also saved at the master kernel Page Tables init_mm.pgd. The snippet is from linux/linux-4.6.2/arch/riscv/mm/init.c
void __init paging_init(void)
{
    init_mm.pgd = (pgd_t *)pfn_to_virt(csr_read(sptbr));
  ....
}

Tuesday, September 6, 2016

Waiting for concurrent page fault completion

An interesting call stack when a thread waits in a page fault for another thread completing paging data from a file

00 nt!KiSwapContext
01 nt!KiSwapThread
02 nt!KiCommitThreadWait
03 nt!KeWaitForSingleObject
04 nt!MiWaitForCollidedFaultComplete
05 nt!MiResolveTransitionFault
06 nt!MiResolveProtoPteFault
07 nt!MiDispatchFault
08 nt!MmAccessFault
09 nt!KiPageFault
0a nt!memcpy
0b nt!CcCopyBytesToUserBuffer
0c nt!CcMapAndCopyFromCache
0d nt!CcCopyReadEx
0e nt!CcCopyRead
0f nt!FsRtlCopyRead
10 ***
11 ***
12 ***
13 nt!NtReadFile
14 nt!KiSystemServiceCopyEnd

Friday, September 2, 2016

FileObjects and SectionObjectPointer in Windows.

Just for the record.

FileObject->SectionObjectPointer is allocated and set by a file system driver but the structure is managed by the Memory Manager (Mm). SectionObjectPointer is shared between all file objects for the same data stream.

FileObject->SectionObjectPointer->DataSectionObject and FileObject->SectionObjectPointer->ImageSectionObject contain address of ControlArea for data and image.

ControlArea deletion is synchronized by ControlArea->WaitingForDeletion and ControlArea->u.Flags.BeingDeleted. WaitingForDeletion points to a structure with notification event and a reference counter.

All functions that might destroy control area take SectionObjectPointer as a parameter. These functions acquire a global lock then check that ControlArea is not NULL. If control area exists ControlArea->u.Flags.BeingDeleted is checked and if it is set a function waits on WaitingForDeletion event with incremented reference counter so the event is deleted when the last waiting thread exit from a waiting state and the reference counter drops to zero. A call to MiCleanSection set SectionObjectPointer->DataSectionObject  and  SectionObjectPointer->ImageSectionObject  to NULL. This call is synchronized with ControlArea->u.Flags.BeingDeleted.

The functions that might delete control area include MmFlushImageSection and CcPurgeCacheSection. That means that it is safe to provide SectionObjectPointer to these functions without synchronizing with file objects deletion. It is even possible to call this functions with a SectionObjectPointer when all related file objects have been deleted or have IopDeleteFile being called for them which might happen in IRP_MJ_PNP processing path.

Friday, August 26, 2016

File mapping and FILE_OBJECT in Windows

There is a WinDBG command !ca that shows file mapping related information. I will show how to get this file mapping information for a file object ( FILE_OBJECT type) by a direct access to structures.

The core of file mapping( and file data caching that uses file mapping ) is SEGMENT object and CONTROL_AREA structures. SEGMENT object contains a pointer to an array of Prototype PTEs ( ProtoPTE ) of _MMPTE_PROTOTYPE type. Each ProtoPTE points to a related physical page if the page is valid. When a file mapping is created the related  virtual memory range PTEs( Page Table Entries ) have the invalid bit set and point to Prototype PTEs. When a corresponding virtual address is accessed a page fault happens, the page fault handler follows a link to ProtoPTE and fixes process PTE to point to a real page. That allows all processes to share the same physical pages for the same file memory mapping. The physical page might need to be allocated and data read in from a file if this has not been done before, after that the page is shared between all processes mapping the file.

FILE_OBJECT has SectionObjectPointer field which is set by a file system driver (FSD) but all its fields are initialized by Memory Manager(CC) and Cache Manager(CC). SectionObjectPointer is of _SECTION_OBJECT_POINTERS type with DataSectionObject field pointing to a CONTROL_AREA structure that in turn points to a SEGMENT object. CONTROL_AREA has a _SUBSECTION structure following it at the tail, all subsequent _SUBSECTION structures are linked by NextSubsection  pointer. Each _SUBSECTION has SubsectionBase field that points to a related ProtoPTEs array.

Below all these structures for a real file object are printed from WinDBG.

0: kd> ??FileObject
struct _FILE_OBJECT * 0x8750ef80
   +0x000 Type             : 0n5
   +0x002 Size             : 0n128
   +0x004 DeviceObject     : 0x879c9030 _DEVICE_OBJECT
   +0x008 Vpb              : 0x879d5888 _VPB
   +0x00c FsContext        : 0x87bdde68 Void
   +0x010 FsContext2       : 0x863cc188 Void
   +0x014 SectionObjectPointer : 0x87bddea8 _SECTION_OBJECT_POINTERS
   +0x018 PrivateCacheMap  : 0x869acf90 Void
   +0x01c FinalStatus      : 0n0
   +0x020 RelatedFileObject : (null) 
   +0x024 LockOperation    : 0 ''
   +0x025 DeletePending    : 0 ''
   +0x026 ReadAccess       : 0 ''
   +0x027 WriteAccess      : 0 ''
   +0x028 DeleteAccess     : 0 ''
   +0x029 SharedRead       : 0 ''
   +0x02a SharedWrite      : 0 ''
   +0x02b SharedDelete     : 0 ''
   +0x02c Flags            : 0xc0012
   +0x030 FileName         : _UNICODE_STRING "\Sample Pictures\Chrysanthemum.jpg"
   +0x038 CurrentByteOffset : _LARGE_INTEGER 0x11000
   +0x040 Waiters          : 0
   +0x044 Busy             : 1
   +0x048 LastLock         : (null) 
   +0x04c Lock             : _KEVENT
   +0x05c Event            : _KEVENT
   +0x06c CompletionContext : (null) 
   +0x070 IrpListLock      : 0
   +0x074 IrpList          : _LIST_ENTRY [ 0x8750eff4 - 0x8750eff4 ]
   +0x07c FileObjectExtension : 0x8774e950 Void

0: kd> ??FileObject->SectionObjectPointer
struct _SECTION_OBJECT_POINTERS * 0x87bddea8
   +0x000 DataSectionObject : 0x863c1758 Void
   +0x004 SharedCacheMap   : 0x869acea0 Void
   +0x008 ImageSectionObject : (null) 

0: kd> dt nt!_CONTROL_AREA 0x863c1758 
   +0x000 Segment          : 0xaeb311a8 _SEGMENT
   +0x004 DereferenceList  : _LIST_ENTRY [ 0x0 - 0x0 ]
   +0x00c NumberOfSectionReferences : 1
   +0x010 NumberOfPfnReferences : 0x40
   +0x014 NumberOfMappedViews : 1
   +0x018 NumberOfUserReferences : 0
   +0x01c u                : <unnamed-tag>
   +0x020 FlushInProgressCount : 0
   +0x024 FilePointer      : _EX_FAST_REF
   +0x028 ControlAreaLock  : 0n0
   +0x02c ModifiedWriteCount : 0
   +0x02c StartingFrame    : 0
   +0x030 WaitingForDeletion : (null) 
   +0x034 u2               : <unnamed-tag>
   +0x040 LockedPages      : 0n1
   +0x048 ViewList         : _LIST_ENTRY [ 0x86a3a898 - 0x86a3a898 ]

0: kd> ??sizeof(nt!_CONTROL_AREA)
unsigned int 0x50

0: kd> dt nt!_SUBSECTION 0x863c1758+0x50
   +0x000 ControlArea      : 0x863c1758 _CONTROL_AREA
   +0x004 SubsectionBase   : 0xa8e9b008 _MMPTE
   +0x008 NextSubsection   : (null) 
   +0x00c PtesInSubsection : 0x40
   +0x010 UnusedPtes       : 0
   +0x010 GlobalPerSessionHead : (null) 
   +0x014 u                : <unnamed-tag>
   +0x018 StartingSector   : 0
   +0x01c NumberOfFullSectors : 0x40

0: kd> dt nt!_SEGMENT  0xaeb311a8 
   +0x000 ControlArea      : 0x863c1758 _CONTROL_AREA
   +0x004 TotalNumberOfPtes : 0x40
   +0x008 SegmentFlags     : _SEGMENT_FLAGS
   +0x00c NumberOfCommittedPages : 0
   +0x010 SizeOfSegment    : 0x40000
   +0x018 ExtendInfo       : (null) 
   +0x018 BasedAddress     : (null) 
   +0x01c SegmentLock      : _EX_PUSH_LOCK
   +0x020 u1               : <unnamed-tag>
   +0x024 u2               : <unnamed-tag>
   +0x028 PrototypePte     : 0xa8fe97e8 _MMPTE
   +0x030 ThePtes          : [1] _MMPTE

1: kd> dt nt!_MMPTE .
   +0x000 u                :
      +0x000 Long             : Uint8B
      +0x000 VolatileLong     : Uint8B
      +0x000 HighLow          : _MMPTE_HIGHLOW
      +0x000 Flush            : _HARDWARE_PTE
      +0x000 Hard             : _MMPTE_HARDWARE
      +0x000 Proto            : _MMPTE_PROTOTYPE
      +0x000 Soft             : _MMPTE_SOFTWARE
      +0x000 TimeStamp        : _MMPTE_TIMESTAMP
      +0x000 Trans            : _MMPTE_TRANSITION
      +0x000 Subsect          : _MMPTE_SUBSECTION
      +0x000 List             : _MMPTE_LIST

1: kd> dt nt!_MMPTE_PROTOTYPE
   +0x000 Valid            : Pos 0, 1 Bit
   +0x000 Unused0          : Pos 1, 7 Bits
   +0x000 ReadOnly         : Pos 8, 1 Bit
   +0x000 Unused1          : Pos 9, 1 Bit
   +0x000 Prototype        : Pos 10, 1 Bit
   +0x000 Protection       : Pos 11, 5 Bits
   +0x000 Unused           : Pos 16, 16 Bits
   +0x000 ProtoAddress     : Pos 32, 32 Bits


Tuesday, February 3, 2015

Oooops in Windows 10 filesystem filter

Windows 10 might be naughty to old style FSD filters or filters that registered to be called after Windows' WdFilter FS minifilter as the last one's create request postcallback sends a scan message (allegedly) to user mode that results in a delay in create IRP completion, in many cases this file object is used by an underlying FSD to initialize the cache, so this results in a situation when a file system filter driver observes an IO for the file object with IRP_MJ_CREATE request still lingering i.e. a scenario when not all completion routines and FS minifilter's postcallbacks have been called. For example the first IRP_MJ_CREATE completion is waiting in  WdFilter!MpScanFile and in the same time an application opens the file for writing and this create requests completes promptly, then the application issues a write request that results in first writing the data in the cache and then is followed by a synchronous cache flush with the file object for which IRP_MJ_CREATE is waiting in WdFilter!MpScanFile .

Below there are two call stacks for this scenario.

A create request issued by an FSD filter :

00 ffffd000`2060fd90 fffff804`0082020e nt!KiSwapContext+0x76
01 ffffd000`2060fed0 fffff804`0081f6ff nt!KiSwapThread+0x66e
02 ffffd000`2060ff90 fffff804`0081ee44 nt!KiCommitThreadWait+0x12f
03 ffffd000`20610010 fffff804`00c88ef1 nt!KeWaitForMultipleObjects+0x424
04 ffffd000`206100d0 fffff801`89a08d57 nt!FsRtlCancellableWaitForMultipleObjects+0x91
05 ffffd000`20610140 fffff801`8aa798fb FLTMGR!FltSendMessage+0x497
06 ffffd000`20610280 fffff801`8aa751f3 WdFilter!MpScanFile+0x5cb
07 ffffd000`20610400 fffff801`8aa747ea WdFilter!MpAmPostCreate+0x8a3
08 ffffd000`20610600 fffff801`89a03633 WdFilter!MpPostCreate+0x17a
09 ffffd000`206106b0 fffff801`89a03033 FLTMGR!FltpPerformPostCallbacks+0x2f3
0a ffffd000`20610780 fffff801`89a049f6 FLTMGR!FltpPassThroughCompletionWorker+0x73
0b ffffd000`206107c0 fffff801`89a314eb FLTMGR!FltpLegacyProcessingAfterPreCallbacksCompleted+0x1d6
0c ffffd000`20610830 fffff801`8d067232 FLTMGR!FltpCreate+0x33b
................................................................... Here was a pass through filter, removed because of NDA
15 ffffd000`206113c0 fffff804`00814ef2 nt!IovCallDriver+0x3d8
16 ffffd000`20611420 fffff804`00bfbf0e nt!IofCallDriver+0x72
17 ffffd000`20611460 fffff804`00c0313d nt!IopParseDevice+0x6ae
18 ffffd000`20611670 fffff804`00c0112c nt!ObpLookupObjectName+0x6ed
19 ffffd000`206117e0 fffff804`00be4029 nt!ObOpenObjectByName+0x1ec
1a ffffd000`20611910 fffff804`00be3ba0 nt!IopCreateFile+0x369
1b ffffd000`206119b0 fffff804`00d07b35 nt!IoCreateFileEx+0x100
1c ffffd000`20611a40 fffff801`8d07775a nt!IoCreateFileSpecifyDeviceObjectHint+0xe5
................................................................... Here was a pass through filter, removed because of NDA
1f ffffd000`20612780 fffff804`00ef0e6c VerifierExt!xdv_IRP_MJ_DEVICE_CONTROL_wrapper+0xff
20 ffffd000`206127e0 fffff804`00814ef2 nt!IovCallDriver+0x3d8
21 ffffd000`20612840 fffff804`00c0a587 nt!IofCallDriver+0x72
22 ffffd000`20612880 fffff804`00c90186 nt!IopXxxControlFile+0x8f7
23 ffffd000`20612a20 fffff804`00940f63 nt!NtDeviceIoControlFile+0x56
24 ffffd000`20612a90 00007ffe`803bda6a nt!KiSystemServiceCopyEnd+0x13

25 00000000`047bea08 00000000`00000000 ntdll!NtDeviceIoControlFile+0xa

A write request with a file object still in the IRP_MJ_CREATE completion phase, this file object is used to flush cache as this file object was used by FSD to back the mapped file segment object which is used in the file system cache implementation :

15 ffffd000`227c6280 fffff804`00814ef2 nt!IovCallDriver+0x3d8
16 ffffd000`227c62e0 fffff804`008bbfec nt!IofCallDriver+0x72
17 ffffd000`227c6320 fffff804`008bbe76 nt!IoSynchronousPageWriteEx+0x140
18 ffffd000`227c6360 fffff804`0084ca5a nt!MiIssueSynchronousFlush+0x6a
19 ffffd000`227c63d0 fffff804`0084ef08 nt!MiFlushSectionInternal+0x96a
1a ffffd000`227c65f0 fffff804`0084fadc nt!MmFlushSection+0x108
1b ffffd000`227c66c0 fffff804`00833982 nt!CcFlushCachePriv+0x57c
1c ffffd000`227c67c0 fffff804`0087bebe nt!CcMapAndCopyInToCache+0x622
1d ffffd000`227c68b0 fffff801`8d300f32 nt!CcCopyWriteEx+0x1de
1e ffffd000`227c6950 fffff801`8d2c772d fastfat!FatCommonWrite+0x194a
1f ffffd000`227c6b90 fffff804`00ef0e6c fastfat!FatFsdWrite+0xed
20 ffffd000`227c6bd0 fffff804`00814ef2 nt!IovCallDriver+0x3d8
21 ffffd000`227c6c30 fffff801`89a04974 nt!IofCallDriver+0x72
22 ffffd000`227c6c70 fffff801`89a02972 FLTMGR!FltpLegacyProcessingAfterPreCallbacksCompleted+0x154
23 ffffd000`227c6ce0 fffff801`8d067232 FLTMGR!FltpDispatch+0xb2
................................................................... Here was a pass through filter, removed because of NDA
2c ffffd000`227c7820 fffff804`00814ef2 nt!IovCallDriver+0x3d8
2d ffffd000`227c7880 fffff804`00c70d62 nt!IofCallDriver+0x72
2e ffffd000`227c78c0 fffff804`00c71f37 nt!IopSynchronousServiceTail+0x162
2f ffffd000`227c7990 fffff804`00940f63 nt!NtWriteFile+0x687
30 ffffd000`227c7a90 00007ffe`803bda7a nt!KiSystemServiceCopyEnd+0x13
31 00000000`076fc8a8 00007ffe`7d7617cc ntdll!NtWriteFile+0xa
32 00000000`076fc8b0 00007ffe`7d762657 KERNELBASE!WriteFile+0x10c
33 00000000`076fc930 00007ffe`7d760524 KERNELBASE!BaseCopyStream+0x953
34 00000000`076fda90 00007ffe`7d7cfd09 KERNELBASE!BasepCopyFileExW+0x774
35 00000000`076fe070 00007ffe`7e45bb74 KERNELBASE!CopyFile2+0xf9
36 00000000`076fe180 00007ffe`7e45555b SHELL32!CFSTransfer::_PerformCopyFileWithRetry+0xd0
37 00000000`076fe230 00007ffe`7e1c6cbd SHELL32!CFSTransfer::CopyItem+0x23b
38 00000000`076fe2a0 00007ffe`7e1b9c16 SHELL32!CCopyOperation::_CreateDestinationOrCopyItemWithRetry+0xc1
39 00000000`076fe360 00007ffe`7dededf0 SHELL32!CCopyOperation::Do+0x126
3a 00000000`076fe660 00007ffe`7dedeaaf SHELL32!CCopyWorkItem::_DoOperation+0x50
3b 00000000`076fe6d0 00007ffe`7dede3b8 SHELL32!CCopyWorkItem::_SetupAndPerformOp+0x273
3c 00000000`076fea10 00007ffe`7dede70e SHELL32!CCopyWorkItem::ProcessWorkItem+0x194
3d 00000000`076fecd0 00007ffe`7dede479 SHELL32!CCopyWorkItem::_ProcessChildren+0xf6
3e 00000000`076fed50 00007ffe`7dee0f7e SHELL32!CCopyWorkItem::ProcessWorkItem+0x255
3f 00000000`076ff010 00007ffe`7dee2dbd SHELL32!CRecursiveFolderOperation::Do+0x1ce
40 00000000`076ff0c0 00007ffe`7dee27cf SHELL32!CFileOperation::_EnumRootDo+0x2c9
41 00000000`076ff170 00007ffe`7dee1c3f SHELL32!CFileOperation::PrepareAndDoOperations+0x1c7
42 00000000`076ff250 00007ffe`7e46c339 SHELL32!CFileOperation::PerformOperations+0xcf
43 00000000`076ff2b0 00007ffe`7e46a79b SHELL32!CFSDropTargetHelper::_MoveCopyHIDA+0x271
44 00000000`076ff370 00007ffe`7e46d10e SHELL32!CFSDropTargetHelper::_Drop+0x333
45 00000000`076ff650 00007ffe`7d65a737 SHELL32!CFSDropTargetHelper::s_DoDropThreadProc+0x3e
46 00000000`076ff680 00007ffe`7fde5f72 SHCORE!GetDpiForMonitor+0x157
47 00000000`076ff770 00007ffe`80389b54 KERNEL32!BaseThreadInitThunk+0x22
48 00000000`076ff7a0 00000000`00000000 ntdll!RtlUserThreadStart+0x34

The file object

5: kd> dt nt!_FILE_OBJECT ffffe0015b432800
   +0x000 Type             : 0n5
   +0x002 Size             : 0n216
   +0x008 DeviceObject     : 0xffffe001`5b027c00 _DEVICE_OBJECT
   +0x010 Vpb              : 0xffffe001`62c9c290 _VPB
   +0x018 FsContext        : 0xffffc001`cba39da0 Void
   +0x020 FsContext2       : 0xffffc001`cc292340 Void
   +0x028 SectionObjectPointer : 0xffffe001`59d656f0 _SECTION_OBJECT_POINTERS
   +0x030 PrivateCacheMap  : (null) 
   +0x038 FinalStatus      : 0n0
   +0x040 RelatedFileObject : (null) 
   +0x048 LockOperation    : 0 ''
   +0x049 DeletePending    : 0 ''
   +0x04a ReadAccess       : 0x1 ''
   +0x04b WriteAccess      : 0 ''
   +0x04c DeleteAccess     : 0 ''
   +0x04d SharedRead       : 0x1 ''
   +0x04e SharedWrite      : 0x1 ''
   +0x04f SharedDelete     : 0x1 ''
   +0x050 Flags            : 0x42
   +0x058 FileName         : _UNICODE_STRING "\TestPicture\bootmgr"
   +0x068 CurrentByteOffset : _LARGE_INTEGER 0x0
   +0x070 Waiters          : 0
   +0x074 Busy             : 0
   +0x078 LastLock         : (null) 
   +0x080 Lock             : _KEVENT
   +0x098 Event            : _KEVENT
   +0x0b0 CompletionContext : (null) 
   +0x0b8 IrpListLock      : 0
   +0x0c0 IrpList          : _LIST_ENTRY [ 0xffffe001`5b4328c0 - 0xffffe001`5b4328c0 ]

   +0x0d0 FileObjectExtension : 0xffffcf81`75df8fb0 Void

The create IRP

5: kd> !irp 0xffffcf81`75fb8a20 1
Irp is active with 17 stacks 17 is current (= 0xffffcf8175fb8f70)
 No Mdl: No System Buffer: Thread ffffe0015941d080:  Irp stack trace.  
Flags = 40000884
ThreadListEntry.Flink = ffffcf8176c30ec0
ThreadListEntry.Blink = ffffe0015941d6e0
IoStatus.Status = 00000000
IoStatus.Information = 00000001
RequestorMode = 00000000
Cancel = 00
CancelIrql = 0
ApcEnvironment = 00
UserIosb = ffffd00020611560
UserEvent = 00000000
Overlay.AsynchronousParameters.UserApcRoutine = 00000000
Overlay.AsynchronousParameters.UserApcContext = 00000000
Overlay.AllocationSize = 00000000 - 00000000
CancelRoutine = 00000000   
UserBuffer = 00000000
&Tail.Overlay.DeviceQueueEntry = ffffcf8175fb8a98
Tail.Overlay.Thread = ffffe0015941d080
Tail.Overlay.AuxiliaryBuffer = 00000000
Tail.Overlay.ListEntry.Flink = 00000000
Tail.Overlay.ListEntry.Blink = 00000000
Tail.Overlay.CurrentStackLocation = ffffcf8175fb8f70
Tail.Overlay.OriginalFileObject = ffffe0015b432800
Tail.Apc = 00000000
Tail.CompletionKey = 00000000
     cmd  flg cl Device   File     Completion-Context

...................................
Args: 00000000 00000000 00000000 00000000
 [  0, 0]   0 10 ffffe00159976970 00000000 fffff80189a09220-ffffe001595856b0    
      \FileSystem\fastfat FLTMGR!FltpSynchronizedOperationCompletion
Args: 00000000 00000000 00000000 00000000
>[  0, 0]   0 e0 ffffe001599b8060 ffffe0015b432800 fffff8018d0498b0-ffffcf81765d0f80 Success Error Cancel 
      \FileSystem\FltMgr DeviceLockDriver0!DlFoHashCreateRequestHookCompletion

Args: ffffd000206115f0 01000120 00070080 00000000

The write IRP with the FILE_OBJECT at ffffe0015b432800 that has its IRP_MJ_CREATE  Irp waiting in a minifilter postcreate callback

5: kd> !irp 0xffffcf81`76a62a20 0x1
Irp is active with 17 stacks 17 is current (= 0xffffcf8176a62f70)
 Mdl=ffffe0016312e280: No System Buffer: Thread ffffe00163211080:  Irp stack trace.  
Flags = 40060043
ThreadListEntry.Flink = ffffcf8176562a40
ThreadListEntry.Blink = ffffe001632116e0
IoStatus.Status = 00000000
IoStatus.Information = 00000000
RequestorMode = 00000000
Cancel = 00
CancelIrql = 0
ApcEnvironment = 00
UserIosb = ffffd000227c6850
UserEvent = ffffd000227c6390
Overlay.AsynchronousParameters.UserApcRoutine = 00000000
Overlay.AsynchronousParameters.UserApcContext = 00000000
Overlay.AllocationSize = 00000000 - 00000000
CancelRoutine = 00000000   
UserBuffer = 00000000
&Tail.Overlay.DeviceQueueEntry = ffffcf8176a62a98
Tail.Overlay.Thread = ffffe00163211080
Tail.Overlay.AuxiliaryBuffer = 00000000
Tail.Overlay.ListEntry.Flink = 00000000
Tail.Overlay.ListEntry.Blink = 00000000
Tail.Overlay.CurrentStackLocation = ffffcf8176a62f70
Tail.Overlay.OriginalFileObject = ffffe0015b432800
Tail.Apc = 00000000
Tail.CompletionKey = 00000000
     cmd  flg cl Device   File     Completion-Context

..........................

>[  4, 0]   4  0 ffffe001599b8060 ffffe0015b432800 00000000-00000000    
      \FileSystem\FltMgr
Args: 00020000 00000000 00000000 00000000



Monday, August 4, 2014

User space access on Windows, Mac OS X and Linux.

I made this notes just for a record to have all in one place.

Windows

All access to user mode space must be done by code protected by SEH, for more information look at MSDN's  "Structured Exception Handling" .

__try
{
   AccessUserModeSpace();
}
__except( ExceptionFilter() )
{
    ExceptionHandler();
}

Below is a discussion of SEH internal implementation in a compiler and kernel.

Windows 32 bit

A plethora of information is available online, e.g. "A Crash Course on the Depths of Win32 Structured Exception Handling"  , the basic idea is placing SEH frame registrations with filter and handler addresses on a stack and linking them in a list with a head at fs:PcExceptionList where fs is an IA 32 register containing an address for per-thread information, i.e. it is reloaded on each thread switching, so when an exception on memory access happens the kernel exception handler calls RtlDispatchException that knows where to find a list of registered exception filters and handlers.

Windows 64 bit

When a compiler meets __try() __except() construction it adds an entry in the exception table that contains start and end address for a protected region and offsets for filter and handler, this is similar to the method used on Linux. The details can be found here "Exceptional Behavior - x64 Structured Exception Handling"  below I describe what is not mentioned in this article.

 The linker adds an entry for every function, this entry contains an information for RtlVirtualUnwind that is called by RtlDispatchException , this allows to call any function inside a __try() block and the kernel still able to find a way up the call stack to a place where an exception filter and handler are registered, while in case of 32 bit kernel this is done by looking at fs:PcExceptionList  but the kernel doesn't have such luxury in 64 bit mode.

You can look at these entries with .fnent command.

For example an entry for memcpy that is  used to copy memory to/from user space, this function does not register any exception handlers but the kernel must be able to unwind a call stack to find it, so only unwinding information presents

 3: kd> .fnent nt!memcpy
Debugger function entry 00000000`003da6e8 for:
(fffff800`9046b340)   nt!memcpy   |  (fffff800`904d7c50)   nt! ?? ::FNODOBFM::`string'
Exact matches:
    nt!memcpy (<no parameter info>)
    nt!memmove (<no parameter info>)

BeginAddress      = 00000000`00055340
EndAddress        = 00000000`00055679
UnwindInfoAddress = 00000000`00201380

Unwind info at fffff800`90617380, 4 bytes
  version 1, flags 0, prolog 0, codes 0


Now look at the bigger function MmLockAndCopyMemory , as you see an unwinding information is pretty extensive and still no exception filter and handler

2: kd> .fnent nt!MmLockAndCopyMemory
Debugger function entry 00000000`003da6e8 for:
(fffff800`9098f920)   nt!MmLockAndCopyMemory   |  (fffff800`9098fb30)   nt!MmStoreRegister
Exact matches:
    nt!MmLockAndCopyMemory (<no parameter info>)

BeginAddress      = 00000000`00579920
EndAddress        = 00000000`00579b30
UnwindInfoAddress = 00000000`0025406c

Unwind info at fffff800`9066a06c, 20 bytes
  version 2, flags 0, prolog 1c, codes e
  00: offs a, unwind op 6, op info 0 UWOP_EPILOG Length: a. Flags: 0
  01: offs 10, unwind op 6, op info 0 UWOP_EPILOG Offset from end: 10 (FFFFF8009098FB20)
  02: offs 1c, unwind op 4, op info 6 UWOP_SAVE_NONVOL FrameOffset: 70 reg: rsi.
  04: offs 1c, unwind op 4, op info 5 UWOP_SAVE_NONVOL FrameOffset: 68 reg: rbp.
  06: offs 1c, unwind op 4, op info 3 UWOP_SAVE_NONVOL FrameOffset: 60 reg: rbx.
  08: offs 1c, unwind op 2, op info 5 UWOP_ALLOC_SMALL.
  09: offs 18, unwind op 0, op info f UWOP_PUSH_NONVOL reg: r15.
  0a: offs 16, unwind op 0, op info e UWOP_PUSH_NONVOL reg: r14.
  0b: offs 14, unwind op 0, op info d UWOP_PUSH_NONVOL reg: r13.
  0c: offs 12, unwind op 0, op info c UWOP_PUSH_NONVOL reg: r12.
  0d: offs 10, unwind op 0, op info 7 UWOP_PUSH_NONVOL reg: rdi.

Now look at a function that definetly contains a code that accesses user space, e.g. NtReadFile , now you can see _C_specific_handler as a handler routine that takes care of calling filter and handler

3: kd> .fnent nt!NtReadFile
Debugger function entry 00000000`003da6e8 for:
(fffff800`908b6690)   nt!NtReadFile   |  (fffff800`908b6f50)   nt!FsRtlCancellableWaitForMultipleObjects
Exact matches:
    nt!NtReadFile (<no parameter info>)

BeginAddress      = 00000000`004a0690
EndAddress        = 00000000`004a0f50
UnwindInfoAddress = 00000000`00242114

Unwind info at fffff800`90658114, 24 bytes
  version 2, flags 1, prolog 21, codes b
  handler routine: nt!_C_specific_handler (fffff800`905039a0), data 4
  00: offs c, unwind op 6, op info 0 UWOP_EPILOG Length: c. Flags: 0
  01: offs b6, unwind op 6, op info 3 UWOP_EPILOG Offset from end: 3b6 (FFFFF800908B6B9A)
  02: offs 21, unwind op 1, op info 0 UWOP_ALLOC_LARGE FrameOffset: c0.
  04: offs 1a, unwind op 0, op info f UWOP_PUSH_NONVOL reg: r15.
  05: offs 18, unwind op 0, op info e UWOP_PUSH_NONVOL reg: r14.
  06: offs 16, unwind op 0, op info d UWOP_PUSH_NONVOL reg: r13.
  07: offs 14, unwind op 0, op info c UWOP_PUSH_NONVOL reg: r12.
  08: offs 12, unwind op 0, op info 7 UWOP_PUSH_NONVOL reg: rdi.
  09: offs 11, unwind op 0, op info 6 UWOP_PUSH_NONVOL reg: rsi.
  0a: offs 10, unwind op 0, op info 3 UWOP_PUSH_NONVOL reg: rbx.


To be continued (Mac OS X and Linux)...

Saturday, March 29, 2014

A promiscuous MmProbeAndLockPages

Consider a scenario when a Windows driver sweeping through a process address space somehow gets a pointer to a valid address range in the system address space and wants it to be accessible at IRQL greater or equal DISPATCH_LEVEL, i.e. when a scheduler is not available and swapped out pages can't be retrieved from a backing store. The solution is to lock pages by calling MmProbeAndLockPages. Is this is a bullet proof solution? The answer is NO. The driver will cause intermittent system crashes with a stack like shown below

nt!KeBugCheckEx
nt!MiBadRefCount
nt!MiFreePoolPages
nt!ExFreePoolWithTag
<.....>

The reason is that a system pool returns a page to a list of free pages and the system does not expect the page to be locked. This happens when the last allocation from a page has been released so the page does not contain valid allocations and can be returned to the system's list of free pages. The system implies that all pool allocations that have been locked are unlocked by calling MmUnlockPages before being freed by calling ExFreePool.

Monday, March 24, 2014

Windows Object Manager, Paged Pool and elevated IRQL

Surprisingly Windows 8 Object Manager allocates some objects from the Paged Pool, that means that ObReferenceObject and ObDereferenceObject can't be safely called at DISPATCH_LEVEL as the actual maximum IRQL becomes APC_LEVEL if an object is allocated from the paged pool, for example a token object might be from the paged pool, as !pool command shows

1: kd> !pool ffffc00002b73770
Pool page ffffc00002b73770 region is Paged pool
.....
*ffffc00002b73740 size:  8c0 previous size:  1c0  (Allocated) *Toke
Pooltag Toke : Token objects, Binary : nt!se

The object itself ( a pretty large pointer count, but nevertheless this is a valid object )

1: kd> !object ffffc00002b737a0
Object: ffffc00002b737a0  Type: (ffffe00000153db0) Token
    ObjectHeader: ffffc00002b73770 (new version)
    HandleCount: 33  PointerCount: 131067

Driver Verifier was active and cleared the valid bit from a PTE mapping the paged pool's page on which the object was allocated

1: kd> !pte ffffc00002b737a0
                                           VA ffffc00002b737a0
PXE at FFFFF6FB7DBEDC00    PPE at FFFFF6FB7DB80000    PDE at FFFFF6FB700000A8    PTE at FFFFF6E000015B98
contains 000000000134F863  contains 0000000001DCE863  contains 00000001257C2863  contains FB40000129FE9882
pfn 134f      ---DA--KWEV  pfn 1dce      ---DA--KWEV  pfn 1257c2    ---DA--KWEV  not valid
                                                                                  Transition: 129fe9
                                                                                  Protect: 4 - ReadWrite

the PTE was marked as invalid though the physical page actually contains valid data and has not been reused and swapped out, the valid bit will be brought back by the page fault handler when processing a page fault ( this is called a soft page fault when there is no IO from backing store ), but calling ObDereferenceObject and providing this object at DISPATCH_LEVEL would crash the system

TRAP_FRAME:  ffffd000201fc800 -- (.trap 0xffffd000201fc800)
NOTE: The trap frame does not contain all registers.
Some register values may be zeroed or incorrect.
rax=0000000000000005 rbx=0000000000000000 rcx=ffffc00002b737a0
rdx=0000000000000005 rsi=0000000000000000 rdi=0000000000000000
rip=fffff803b20565a3 rsp=ffffd000201fc990 rbp=fffff800017bf594
 r8=0000000000000007  r9=fffff800017debac r10=0000000000000000
r11=ffffd000201fcc70 r12=0000000000000000 r13=0000000000000000
r14=0000000000000000 r15=0000000000000000
iopl=0         nv up ei ng nz na po nc
nt!ObfDereferenceObject+0x23:
fffff803`b20565a3 f0480fc15ed0    lock xadd qword ptr [rsi-30h],rbx ds:ffffffff`ffffffd0=????????????????
Resetting default scope

LAST_CONTROL_TRANSFER:  from fffff803b21f10ea to fffff803b216f890

STACK_TEXT:  
 nt!DbgBreakPointWithStatus
nt!KiBugCheckDebugBreak+0x12
nt!KeBugCheck2+0x8ab
nt!KeBugCheckEx+0x104
nt!KiBugCheckDispatch+0x69
nt!KiPageFault+0x23a
nt!ObfDereferenceObject+0x23
<here is an offending driver ))))>

Thursday, February 27, 2014

What is in a name? ( of a process )

What does PsGetCurrentProcess return?

The answer - it returns the thread's process. But do you know that a thread might have TWO processes? The first one is the parent process that created the thread and the other one is a process to which the thread has been attached by KeStackAttachProcess . Which one does PsGetCurrentProcess return? It returns the attached process if one is not NULL or a parent process otherwise.

So this brings a question - How to get a parent process? The answer is IoThreadToProcess.

The other question - What does it mean "attach to process"? This mean that the thread operates in the address space of the attached process( i.e. PDE and CR3 are changed ). That means that any function that operates on the UserMode part of the address space will change or fetch the data from the attached process. The notion of "attached process" is meaningful only when a thread is executing in the KernelMode, as the system space is nearly completely shared between all processes and changing the Page Tables does not have a serious impact on accessing the system space.

The notion of attaching is much more profound in 32 bit Mac OS X or iOS where all processes have access to the full virtual address space of 4 GB, there is no division on system and user space, when the thread switches to the kernel mode the CR3 register is reloaded, the access to a user space by a pointer is not possible for 32 bit Mac OS X kernel so to access the user space the kernel ( or kernel module ) calls the functions that access the user space by switching CR3. In case of 64 bit Mac OS X or iOS the process space is divided on user space and kernel space and the access by pointer becomes possible though is discouraged by Apple and will crash the system in debug mode when the CR3 is reloaded when a thread enters kernel mode.

Wednesday, February 26, 2014

Outswapped kernel stack

You definitely know that kernel stack can be outswapped if some conditions are met. One such condition is waiting with WaitMode set to UserMode.

 If an event is allocated on a kernel stack it can be swapped out when a driver does something like this

GetOperationCompletionStatus( ... )
{
    KEVENT    Event;

    KeInitializeEvent( &Event. SynchronizationEvent, FALSE );

    KeAcquireSpinlock( &Lock, &OldIrql );
    {
       if( FALSE == Opeartion->Completed  ){
          Opeartion->CompletionEvent = &Event;
          Wait = TRUE;
       }
    }
    KeReleaseSpinLock( &Lock, OldIrql );

    while( Wait ){

       // allow a user to wake up the thread when terminating the
       // process, but note that the stack might be outswapped
       // when the thread is blocked waiting for the event
       WaitStatus = KeWaitForSingleObject( &Event,
                                           Executive,
                                           UserMode,
                                           FALSE,
                                           NULL );

       if( STATUS_SUCCESS != WaitStatus )
       {
          KeAcquireSpinlock( &Lock, &OldIrql );
          {
             // if NULL then go back to waiting as
             // there is a ongoing completion
             if( Opeartion->CompletionEvent ){
                Opeartion->CompletionEvent = NULL;
                Wait = FALSE;
             }
          }
          KeReleaseSpinLock( &Lock, OldIrql );

       }
}


NotifyOfCompletion()
{
    KeAcquireSpinlock( &Lock, &OldIrql );
    {
        Opeartion->Completed = TRUE;
        if( Opeartion->CompletionEvent  ){

           // the following call sometimes crashes the system
           // when tries to access an outswapped page
           KeSetEvent( Opeartion->CompletionEvent,                                      IO_NO_INCREMENT,
                       FALSE );
           Opeartion->CompletionEvent = NULL;
       }
    }
    KeReleaseSpinLock( &Lock, OldIrql );
}

 the reason to do this is when you want to be slightly more gentle and allow a user to terminate a waiting thread, this is a common scenario for distributed file systems where response time might be up to minutes.

    The problem with the above code is that a kernel stack can be swapped out while waiting in  KeWaitForSingleObject whith waiting mode set to UserMode . The call to KeSetEvent tries to access the event on the outswapped stack when the IRQL is DISPATCH_LEVEL and it has nothing to do with a call to KeAcquireSpinlock, the same will be even if you try to call KeSetEvent without raising IRQL as KeSetEvent elevates IRQL when working with the event.

   If you check the event address on an outswapped stack with WinDBG you see 

1: kd> !pte 0xaf792cac
                    VA af792cac
PDE at C0602BD8            PTE at C057BC90
contains 000000005402F863  contains 00000000A5129BE2
pfn 5402f     ---DA--KWEV  not valid
                            Transition: a5129

                            Protect: 1f - Outswapped kernel stack

The solution to the above example is to allocate the event from the NonPaged pool.