Debugging Black Screen Errors on Windows* Using Intel® Debug Extensions for WinDbg*

ID 标签 689757
已更新 1/22/2020
版本 Latest
公共

author-image

作者

Introduction

Windows* black screen hangs and crashes are difficult to debug since the system doesn't display any status or debug information, and frequently regular WinDbg* connection methods are not usable. Intel® Debug Extensions for WinDbg* included with Intel® System Debugger can help you with the debug processes, providing a debug connection method to an otherwise unresponsive Windows* target.

This article shows how to use Intel® Debug Extensions for WinDbg* to analyze black screen hangs and crashes. It is assumed that you are familiar with Intel® DCI debug, and that you have installed Intel® System Debugger and WinDbg* on your host system, enabled DCI on your target system, and connected your host system to your target using a supported DCI method, such as Intel® SVT DCI DbC cable, or Intel® SVT Closed Chassis Adapter. If you are not familiar Intel® System Debugger, please review Intel® System Debugger User Guide.

Loading Debug Symbols

The first step for the debugging using Intel® Debug Extensions for WinDbg* is loading the debug symbols. This process takes a bit of time since the debugger has to enumerate all modules and to download PDB files from Microsoft* server. To see additional details about downloading status run .sym noisy before .reload /f command. The status word BUSY in the left down corner indicates that the command is still executing. Once the symbols are loaded, WinDbg will show kd> command prompt. At this point you can run lm command to see the modules list:

0: kd> lm
start             end                 module name
fffff800`b4400000 fffff800`b445c000   msrpc      (pdb symbols)          c:\symbols\msrpc.pdb\D1A2C906531046A3B06666B1678B02DF1\msrpc.pdb
fffff800`b4460000 fffff800`b44c2000   FLTMGR     (pdb symbols)          c:\symbols\fltMgr.pdb\620A988036C34BAFAD3FA05B3C5E27FF1\fltMgr.pdb
fffff800`b44d0000 fffff800`b44f5000   ksecdd     (pdb symbols)          c:\symbols\ksecdd.pdb\A189E34E147D4EF784C97559FD6C40F61\ksecdd.pdb
fffff800`b89c0000 fffff800`b89e2000   WdNisDrv   (pdb symbols)          c:\symbols\WdNisDrv.pdb\0DF82E576FFE483E932D1BE92599179F1\WdNisDrv.pdb
fffff803`d0f56000 fffff803`d0f61000   kdcom      (pdb symbols)          c:\symbols\kd1394.pdb\13106436C8574DFBB424C696AF2BC1632\kd1394.pdb
fffff803`d2007000 fffff803`d27d3000   nt         (pdb symbols)          c:\symbols\ntkrnlmp.pdb\0DE6DC238E194BB78608D54B1E6FA3791\ntkrnlmp.pdb
fffff803`d27d3000 fffff803`d2846000   hal        (pdb symbols)          c:\symbols\hal.pdb\81C1AF690083498BA941D5EC628CDCF41\hal.pdb
fffff961`10a00000 fffff961`10d82000   win32kfull   (pdb symbols)          c:\symbols\win32kfull.pdb\7E008E1CFF454261A8C9F045658183421\win32kfull.pdb
fffff961`10d90000 fffff961`10ef2000   win32kbase   (pdb symbols)          c:\symbols\win32kbase.pdb\4DCD0ED713B74A56A031ECC9E0D3278F1\win32kbase.pdb
fffff961`10f10000 fffff961`10f1a000   TSDDD      (pdb symbols)          c:\symbols\tsddd.pdb\1D46FEC592A447A08B2DEBBA6ED270191\tsddd.pdb
fffff961`10f20000 fffff961`10f5c000   cdd        (pdb symbols)          c:\symbols\cdd.pdb\14C9A1BFDBE84658B69AFE70FE9BF0B11\cdd.pdb
fffff961`113c0000 fffff961`113e3000   win32k     (pdb symbols)          c:\symbols\win32k.pdb\770C6601DF3E461C95D324075C65528F1\win32k.pdb

Unloaded modules:
fffff800`b5770000 fffff800`b577f000   dump_storport.sys
fffff800`b57b0000 fffff800`b57d5000   dump_storahci.sys

Issue Analysis

When symbols are loaded, the stack trace become more informative, and you can analyze the current state of each processor core using ~<number> to switch cores.

There are several possible black screen causes or hardware related BSODs:

  • Dead loops and deadlocks
  • Kernel Debug transport configuration issues
  • Memory corruption issues, invalid opcodes in key Windows processes
  • Bug Check 0x124: WHEA, NMI interrupt, Machine Check
     

A common method for root causing Windows* issues is to use !analyze -v extension command. This extension performs a tremendous amount of automated analysis. The results of this analysis are displayed in the Debugger Command window.

In case !analyze command fails with “The debuggee is ready to run” message, you may want to force the analysis to take place as if a crash had occurred. Use !analyze -v -f to accomplish this task.

Dead Loops and Deadlocks in Windows*

Let’s start from with possible hang due to pure software issues. Fortunately, Windows* comes with an embedded Driver Verifier tool, that can profile spinlocks. Once deadlocks profiling is enabled, the tool will produce verbose information for a lock state in a crashdump. When the debug connection is established, the !deadlock extension can be used in conjunction with Driver Verifier to detect inconsistent use of locks in your code that have the potential to cause deadlocks.

The Driver Verifier doesn't support APC level locks: mutexes(fast, guard) and resources. These locks can be analyzed using !analyze -hang and !locks commands. If needed !thread extension command can be used to obtain the thread information.

For example, here is typical output of !locks command:

0: kd> !locks
**** DUMP OF ALL RESOURCE OBJECTS ****
KD: Scanning for held locks................................................................................
Resource @ 0xffff8186e25a6d90    Exclusively owned
    Contention Count = 123641
    NumberOfExclusiveWaiters = 5
     Threads: ffff8186ee782080-01<*>
     Threads Waiting On Exclusive Access:
ffff8186edf95080   ffff8186ef3e9080   ffff8186ce4887c0   ffff8186ea1cb080    ffff8186ee7f97c0     
KD: Scanning for held locks....................................
Resource @ 0xffff8186e9bd3d40    Shared 1 owning threads
     Threads: ffff8186ea99a7c0-01<*>
KD: Scanning for held locks..........
12517 total locks, 2 locks currently held

Dead loops can be identified by looking at the instruction pointer, stack, using breakpoints or step by step execution using Step Over command.

Kernel Debug Transport Configuration Issues

A software trap combined with a misconfiguration of the debug transport methods might cause Windows* to wait for the kernel debugger to connect instead of generating BSOD and a crashdump, giving an appearance of unresponsive black screen hang.

Here are some examples of such configuration issues:

  • Network Kernel Debugging is configured, but a supported NIC is not installed in the system
  • Kernel-Mode Debugging over a 1394 (Firewire) Cable is configured, but Firewire controller is not installed in the system
  • Kernel-Mode USB Debugging is configured, but it conflicts with Intel© DCI

This might happen when you are debugging a difficult to reproduce issue, and in this case it is important to collect the debug data.

When Windows* is waiting for the kernel debugger connection, there will be at least one thread with TrapFrame. Stack would look like this:

fffff880`0c440f50 fffff800`035d0a96 nt!ObpLookupObjectName+0x588
fffff880`0c441040 fffff800`035aef66 nt!ObOpenObjectByName+0x306
fffff880`0c441110 fffff800`032d2e53 nt!NtQueryAttributesFile+0x145
fffff880`0c4413a0 00000000`7754168a nt!KiSystemServiceCopyEnd+0x13 (TrapFrame @ fffff880`0c4413a0)
00000000`0008c9f8 00000000`73a2ae19 ntdll!NtQueryAttributesFile+0xa
00000000`0008ca00 00000000`73a1d18f wow64!whNtQueryAttributesFile+0x91
00000000`0008ca80 00000000`750c2776 wow64!Wow64SystemServiceEx+0xd7
00000000`0008d340 00000000`73a1d286 wow64cpu!ServiceNoTurbo+0x2d
00000000`0008d400 00000000`73a18a90 wow64!RunCpuSimulation+0xa
00000000`0008d450 00000000`739e2c52 wow64!Wow64KiUserCallbackDispatcher+0x204
00000000`0008d7a0 00000000`775411f5 wow64win!whcbfnDWORD+0xe2
00000000`0008e190 00000000`739efe4a ntdll!KiUserCallbackDispatcherContinue (TrapFrame @ 00000000`0008e058)
00000000`0008e218 00000000`739caf02 wow64win!ZwUserMessageCall+0xa

In this case you can restore register context from the trap information using .trap [Address] command. For example:

0: kd> .trap 0xfffffffff3a4ea60

ErrCode = 00000000
eax=0000e470 ebx=e1fdb600 ecx=e1753c2c edx=00000011 esi=00740065 edi=e1753be8
eip=8092e20b esp=f3a4ead4 ebp=f3a4eaec iopl=0         nv up ei pl nz na pe nc
cs=0008  ss=0010  ds=0023  es=0023  fs=0030  gs=0000             efl=00010206
nt!ObpLookupDirectoryEntry+0xee:
8092e20b 394608          cmp     dword ptr [esi+8],eax ds:0023:0074006d=????????

The following steps of analysis depend on the type of issue. Search MSDN for the exception type. Generate minidump after restoring trap. Run !analyze -f -v command to further analyze the crash.

Memory Corruption Issues and Invalid Opcodes in Key Windows Processes

The most complex issues to debug are memory corruption issues. This this case the system might crash with seemingly random errors and because of corrupted PTE (Page Table) crash dumps might not contain useful information. The recommendation is to is to run Driver Verifier on all non-Microsoft drivers. If it doesn't find any violations run !chkimg, and if memory corruption happens in a non-writable area, protected by NX bit, it might be caused by a BIOS issue, a memory controller issue, or a malware.

0: kd> !chkimg -lo 50 -d !nt
    fffff803ae41f594-fffff803ae41f595  2 bytes - nt!MiDuplicateCloneLeaf+38
       [ 80 fa:00 ec ]
    fffff803ae42024f-fffff803ae420250  2 bytes - nt!MiExpandPagedPool+83 (+0xcbb)
       [ 80 f6:00 ea ]
    fffff803ae420461-fffff803ae420462  2 bytes - nt!MiExpandSystemCache+85 (+0x212)
       [ 80 f6:00 ea ]

Bug Check 0x124: WHEA, NMI Interrupt, Machine Check

The most interesting to analyze are the hardware issues that lead to unrecoverable errors. In this case it is not guaranteed that system will fail with the 0x124 error. It also might not be able to successfully write the crashdump to the disk. The system might freeze after the second NMI interupt, but before the BSOD screen is shown. In such case first run !analyze -v to confirm that the issue is uncorrectable HW error. Next run !whea and !errrec extensions to obtain the crash details. Here is an example:

1: kd> !whea
Error Source Table @ fffff801a632f4a0
5 Error Sources
 
Error Source 0 @ ffff8200b72b0bd0
   Notify Type      : {b7f99bd0-8200-ffff-a8f4-32a601f8ffff}
   Type             : 0x4 (PCIe)
   Error Count      : 1
   Record Count     : 1
   Record Length    : 750   Error Records    : wrapper @ ffff8200b733a010  record @ ffff8200b733a038
   Descriptor       : @ ffff8200b72b0c29
      Length                     : 3cc
      Max Raw Data Length        : d0
      Num Records To Preallocate : 1
      Max Sections Per Record    : 3
      Error Source ID            : 0
      Flags                      : 00000000
Error Source 1 @ ffff8200b7f99bd0
   Notify Type      : {b7f9cbd0-8200-ffff-d00b-2bb70082ffff}
   Type             : 0x0 (MCE)
   Error Count      : 0
   Record Count     : 4
   Record Length    : 728
   Error Records    : wrapper @ ffff8200b91f3000  record @ ffff8200b91f3028
                    : wrapper @ ffff8200b91f3728  record @ ffff8200b91f3750
                    : wrapper @ ffff8200b91f3e50  record @ ffff8200b91f3e78
                    : wrapper @ ffff8200b91f4578  record @ ffff8200b91f45a0
........
Error Source 4 @ ffff8200b1a7fb60
   Notify Type      : {b7f65bd0-8200-ffff-d0cb-f9b70082ffff}
   Type             : 0x3 (NMI)
   Error Count      : 0
   Record Count     : 1
   Record Length    : 6c0
   Error Records    : wrapper @ ffff8200b7f91940  record @ ffff8200b7f91968
   Descriptor       : @ ffff8200b1a7fbb9
      Length                     : 3cc
      Max Raw Data Length        : 100
      Num Records To Preallocate : 1
      Max Sections Per Record    : 3
      Error Source ID            : 3
      Flags                      : 00000000

And

1: kd> !errrec ffff8200b733a038
===============================================================================
Common Platform Error Record @ ffff8200b733a038
-------------------------------------------------------------------------------
Record Id     : 01d22bad8598f73b
Severity      : Fatal (1)
Length        : 672
Creator       : Microsoft
Notify Type   : PCI Express Error
Timestamp     : 10/21/2016 15:13:03 (UTC)
Flags         : 0x00000000

===============================================================================
Section 0     : PCI Express
-------------------------------------------------------------------------------
Descriptor    @ ffff8200b733a0b8
Section       @ ffff8200b733a148
Offset        : 272
Length        : 208
Flags         : 0x00000001 Primary
Severity      : Fatal

Port Type     : Root Port
Version       : 1.1
Command/Status: 0x4010/0x0504
Device Id     :
  VenId:DevId : 8086:a296
  Class code  : 030400
  Function No : 0x00
  Device No   : 0x1c
  Segment     : 0x0000
  Primary Bus : 0x00
  Second. Bus : 0x00
  Slot        : 0x0000
Dev. Serial # : 0000000000000000
Express Capability Information @ ffff8200b733a17c
Device Caps : 00008001 Role-Based Error Reporting: 1
  Device Ctl  : 0007 ur FE NF CE
  Dev Status  : 0014 ur FE nf ce
   Root Ctl   : 0008 fs nfs cs

AER Information @ ffff8200b733a1b8
  Uncorrectable Error Status    : 00000010 ur ecrc mtlp rof uc ca cto fcp ptlp sd DLP und
  Uncorrectable Error Mask      : 00010000 ur ecrc mtlp rof UC ca cto fcp ptlp sd dlp und
  Uncorrectable Error Severity  : 00060011 ur ecrc MTLP ROF uc ca cto fcp ptlp sd DLP UND
  Correctable Error Status      : 00000000 adv rtto rnro dllp tlp re
  Correctable Error Mask        : 00002000 ADV rtto rnro dllp tlp re
  Caps & Control                : 00000004 ecrcchken ecrcchkcap ecrcgenen ecrcgencap FEP
  Header Log                    : 00000000 00000000 00000000 00000000
  Root Error Command            : 00000000 fen nfen cen
  Root Error Status             : 00000000 MSG# 00 fer nfer fuf mur ur mcr cer
  Correctable Error Source ID   : 00,00,00
  Correctable Error Source ID   : 00,00,00

These two commands contain enough information to find the actual problem. In this example, PCI Express (PCIe) advanced error reporting structure provided(errors marked red) for device 8086:a296 (South Bridge). From the PCIe documentation, it appears that 0x124 BSOD is triggered by the “Data Link Protocol Error” UCE. The further analysis could be done by the PCIe team.

Conclusion

While debugging Windows* black screen hangs and crashes is a difficult task, Intel® Debug Extensions for WinDbg* included with Intel® System Debugger simplifies the debug process by providing Intel© DCI connection method to otherwise unresponsive Windows* target.

"