The BSOD follows the opening of a VRAM-intensive application while training runs on the GPU. I've opened such applications regularly in the past without problems, with even bigger models, and the applications would utilize shared memory if they didn't have enough (which training cannot do) - no clue what's different now.
Training runs in __main__ under try-except, and the dump file informs (as here),
Typically, the referenced address is in freed memory or is simply invalid. This cannot be protected by a try - except handler -- it can only be protected by a probe or similar programming techniques.
If I replace with try-finally, it's nearly the same message, except it does say try-finally instead, despite there being numerous try-excepts leading in non-__main__ code. I've yet to try running without try in __main__.
Training runs via PyTorch Lightning, AMP16, no multiprocessing. It's not easily reproduced, and the dump file doesn't seem to point to Python code. How can this be debugged?
Env info: Windows 10 x64, RTX 2080 Super, AMD Ryzen 5 3600, BIOS 4602, latest drivers
Python 3.8.12, pytorch 1.10.0, pytorch-lightning 1.5.7
Dump summary:
************* Path validation summary **************
Response Time (ms) Location
Deferred srv*
Symbol search path is: srv*
Executable search path is:
Windows 10 Kernel Version 19041 MP (12 procs) Free x64
Product: WinNt, suite: TerminalServer SingleUserTS Personal
Edition build lab: 19041.1.amd64fre.vb_release.191206-1406
Machine Name:
Kernel base = 0xfffff803`80e00000 PsLoadedModuleList = 0xfffff803`81a2a2b0
Debug session time: Fri Jan 21 19:46:53.513 2022 (UTC + 4:00)
System Uptime: 0 days 0:57:47.148
Loading Kernel Symbols
...............................................................
................................................................
................................................................
..........................
Loading User Symbols
PEB is paged out (Peb.Ldr = 00000099`5954d018). Type ".hh dbgerr001" for details
Loading unloaded module list
............
For analysis of this file, run !analyze -v
nt!KeBugCheckEx:
fffff803`811f72e0 48894c2408 mov qword ptr [rsp+8],rcx ss:0018:ffffd48c`001d3c10=0000000000000050
0: kd> !analyze -v
*******************************************************************************
* *
* Bugcheck Analysis *
* *
*******************************************************************************
PAGE_FAULT_IN_NONPAGED_AREA (50)
Invalid system memory was referenced. This cannot be protected by try-except.
Typically the address is just plain bad or it is pointing at freed memory.
Arguments:
Arg1: ffff9b84f54fe000, memory referenced.
Arg2: 0000000000000002, value 0 = read operation, 1 = write operation.
Arg3: fffff803966bd413, If non-zero, the instruction address which referenced the bad memory
address.
Arg4: 0000000000000000, (reserved)
Debugging Details:
------------------
KEY_VALUES_STRING: 1
Key : AV.Type
Value: Write
Key : Analysis.CPU.mSec
Value: 4937
Key : Analysis.DebugAnalysisManager
Value: Create
Key : Analysis.Elapsed.mSec
Value: 7978
Key : Analysis.Init.CPU.mSec
Value: 734
Key : Analysis.Init.Elapsed.mSec
Value: 7797
Key : Analysis.Memory.CommitPeak.Mb
Value: 85
Key : WER.OS.Branch
Value: vb_release
Key : WER.OS.Timestamp
Value: 2019-12-06T14:06:00Z
Key : WER.OS.Version
Value: 10.0.19041.1
FILE_IN_CAB: MEMORY.DMP
BUGCHECK_CODE: 50
BUGCHECK_P1: ffff9b84f54fe000
BUGCHECK_P2: 2
BUGCHECK_P3: fffff803966bd413
BUGCHECK_P4: 0
READ_ADDRESS: ffff9b84f54fe000 Paged pool
MM_INTERNAL_CODE: 0
IMAGE_NAME: nvlddmkm.sys
MODULE_NAME: nvlddmkm
FAULTING_MODULE: fffff80395f30000 nvlddmkm
BLACKBOXBSD: 1 (!blackboxbsd)
BLACKBOXNTFS: 1 (!blackboxntfs)
BLACKBOXPNP: 1 (!blackboxpnp)
BLACKBOXWINLOGON: 1
PROCESS_NAME: python.exe
TRAP_FRAME: ffffd48c001d3eb0 -- (.trap 0xffffd48c001d3eb0)
NOTE: The trap frame does not contain all registers.
Some register values may be zeroed or incorrect.
rax=ffff9b84f54fd278 rbx=0000000000000000 rcx=ffff9b84f54fe010
rdx=00005c7ea1422a48 rsi=0000000000000000 rdi=0000000000000000
rip=fffff803966bd413 rsp=ffffd48c001d4048 rbp=ffff9b84f54e2000
r8=0000000000000000 r9=0000000000000012 r10=0000000000000000
r11=fffff80396920ed8 r12=0000000000000000 r13=0000000000000000
r14=0000000000000000 r15=0000000000000000
iopl=0 nv up ei pl nz na po nc
nvlddmkm+0x78d413:
fffff803`966bd413 0f2941f0 movaps xmmword ptr [rcx-10h],xmm0 ds:ffff9b84`f54fe000=????????????????????????????????
Resetting default scope
STACK_TEXT:
ffffd48c`001d3c08 fffff803`8124a81f : 00000000`00000050 ffff9b84`f54fe000 00000000`00000002 ffffd48c`001d3eb0 : nt!KeBugCheckEx
ffffd48c`001d3c10 fffff803`8109f4d0 : 00000001`00000000 00000000`00000002 ffffd48c`001d3f30 00000000`00000000 : nt!MiSystemFault+0x18d32f
ffffd48c`001d3d10 fffff803`8120545e : 00000000`00000000 00000000`00000001 fffff803`81a50bc0 ffff9b84`f54fd000 : nt!MmAccessFault+0x400
ffffd48c`001d3eb0 fffff803`966bd413 : fffff803`96c589ad fffff803`968dee68 00000000`00000000 00000000`00000000 : nt!KiPageFault+0x35e
ffffd48c`001d4048 fffff803`96c589ad : fffff803`968dee68 00000000`00000000 00000000`00000000 00000000`00000000 : nvlddmkm+0x78d413
ffffd48c`001d4050 fffff803`96c70d46 : 00000000`00000000 ffffd48c`001d4100 ffff8b06`09f02000 ffff9b84`f54e2000 : nvlddmkm!nvDumpConfig+0x3e750d
ffffd48c`001d4090 fffff803`96d03f47 : ffffd48c`001d4260 fffff803`81a4f0c0 00000000`00000001 ffffd48c`001d46b8 : nvlddmkm!nvDumpConfig+0x3ff8a6
ffffd48c`001d4140 fffff803`96d05607 : 00000000`c000000d ffffd48c`001d4280 00000000`00000001 ffff8b06`09f02000 : nvlddmkm!nvDumpConfig+0x492aa7
ffffd48c`001d4180 fffff803`96c8b9a3 : 00000000`c000000d ffffd48c`001d4469 ffffd48c`001d46b8 00000000`c000000d : nvlddmkm!nvDumpConfig+0x494167
ffffd48c`001d43e0 fffff803`90e1afba : 00000000`00000000 ffff8b06`744ad300 00000000`00000000 00000000`4e562a2a : nvlddmkm!nvDumpConfig+0x41a503
ffffd48c`001d44d0 fffff803`90cf9c39 : ffff8b06`0f21b868 ffffd48c`00000000 ffff8b06`0f21b868 ffffffff`ffffffff : dxgkrnl!TdrIsEnabled+0x821ca
ffffd48c`001d4580 fffff803`81208cb8 : 00000206`40efb900 ffff8b06`1d426080 00000000`00000000 ffff8b06`00000000 : dxgkrnl!NtGdiDdDDIEscape+0x1879
ffffd48c`001d4b00 00007fff`b2bf4be4 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : nt!KiSystemServiceCopyEnd+0x28
00000099`5b30def8 00000000`00000000 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : 0x00007fff`b2bf4be4
SYMBOL_NAME: nvlddmkm+78d413
STACK_COMMAND: .cxr; .ecxr ; kb
BUCKET_ID_FUNC_OFFSET: 78d413
FAILURE_BUCKET_ID: AV_W_(null)_nvlddmkm!unknown_function
OS_VERSION: 10.0.19041.1
BUILDLAB_STR: vb_release
OSPLATFORM_TYPE: x64
OSNAME: Windows 10
FAILURE_ID_HASH: {838100fa-f28b-2ef7-d702-e31713cb338c}
Followup: MachineOwner
---------
Full dump (~5GB uncompressed)
I had the same issue. PAGE_FAULT_IN_NONPAGED_AREA, nvlddmkm.sys, python.exe running pytorch. I was able to reproduce it consistently.
The solution was to use DDU to nuke the NVIDIA drivers, then let Windows Update install their version of the NVIDIA drivers (which tend to be more stable).