Update
If you have not read it, please see ‘Original Post’ below
In order to track down this memory corruption, i.e. why we had pointers pointing to bad (unmapped) memory regions sitting in the free pool and eventually getting handed out, I put together some rather brute force instrumentation in MallocBinned2. Every time it hands out a pointer, it records it in an array, and every time it frees a pointer it removes the array element (I’m just using a flat array that can hold millions of pointers …)
With this I hoped to catch frees on pointers not handed out by mallocbinned2 and double frees.
It may be a red herring, but the most likely suspect seems to be a callstack coming from AsyncLineTraces.
There is a copy that invokes a TArray copy constructor, that is pushing a pointer into the free blocks that either
a) didn’t come from mallocbinned2 or b) was double freed (because it failed my test).
Stack trace looks like:
Seems like its the ‘OutHits’ array being resized from 4 to 0 in FtraceDatum in the AsyncTrace data’s ‘Trace Data’ buffer
TArray>> TraceData;
And then the OutHits’s original data pointer being passed to mallocbinned2. At this time I have not been able to ‘catch’ this memory going from a valid state to invalid, so its possible I have the wrong suspect, but the behavior seems odd. Also of note, these pointers are being pushed into SmallPoolTables[8], which is the table that we are consistently crashing on. We also make heavy heavy use of async line traces on the server.
I have to continue to dig, but I wanted to get feedback from someone who knows the allocation system better than I if this is indeed odd behavior I should investigate or if it seems benign.
Original Post
We are seeing what was a somewhat rare, but now more frequent (possibly due to a much higher running server instance count)
where our server build (running on windows server 2012 built as a shipping x64 target) was crashing and bringing up the native crash handler, without creating a dump or flushing the log.
We recently caught it and created a mini dump with heap through windows, and saw the crash handler was actually crashing in FMallocBinned2::MallocExternal, specifically inside:
void* Result = Pool->AllocateRegularBlock();
Judging from the disassembly, it seemed to be crashing when dereferencing FirstFreeBlock in
bool HasFreeRegularBlock() const
{
CheckCanary(ECanary::FirstFreeBlockIsPtr);
return FirstFreeBlock && FirstFreeBlock->GetNumFreeRegularBlocks() != 0;
}
Here is the callstack:
I did a bit of digging to see why FirstFreeBlock access was causing a 0xc0000005, and saw from the register state that it was looking up SmallPoolTables[8] which looked valid.
However, the first free block pointer, who’s dereference caused the crash, was of course invalid (even though it looks somewhat in the right range) Looking it up in windbg gave me
Content source: 1 (target), length: dd40
0:000> !address 0x000001ec47d0d159
Usage: Free
Base Address: 000001ec`34d90000
End Address: 000001ec`6e170000
Region Size: 00000000`393e0000 ( 915.875 MB)
State: 00010000 MEM_FREE
Protect: 00000001 PAGE_NOACCESS
Type:
But running that on several pools around that pool gave me pointer in valid (64KB) mapped pages. Also of note, the page info pointed to the bad free block and all surrounding structures seemed valid with correct canaries and valid pointers (verified with windbg) so to me it doesn’t look like (at least not a trivial) stomp.
Now since this crash was in the crash reporter, obviously we crashed before this, and I was able to pull the crash context out of memory and run use [rip] and [rsp] from the context to resurrect the initial crash in windbg, and it seems to have crashed in the same place but from an allocation from gameplay code:
I again looked at the register state and it was getting the same exact bad block pointer from that pool state. Which seems consistent, at least.
Has this issue come up at all, or perhaps you have some insight in what could be happening? I’ve run the game with the stomp allocator to try to suss out any bad memory access on our part or any plugins, but as I mentioned, all the other memory around the bad pointer (which itself is a reasonable value) is pristine.
The only other issue on UDN that looked remotely similar was:
https://udn.unrealengine.com/questions/359869/mallocbinned-crash.html
But this looks like its in the original mallocbinned, albeit in almost the exact corresponding place.
I’ll continue trying to see if we are inducing this in anyway, but if you have any insight it would be most helpful.
Thanks!