深入.NET Runtime:一次 OOM 异常的分析与源码追踪之旅

摘要

本文记录了一次对 .NET 应用程序发生的“内存不足”(OOM)异常进行的深度源码级调查。问题始于一个看似矛盾的现象:诊断工具显示有足够大的空闲内存块(约 50MB),但垃圾回收(GC)过程却在尝试预留较小内存段(约 16MB)时失败。

为了探究根源,笔者逆向追踪了诊断工具(如SOS)输出的数据链路:从高层诊断命令(AnalyzeOOMCommand)入手,逐步深入Microsoft.Diagnostics.Runtime (CLRMD)库、托管辅助类、Dac 接口,最终直达 .NET Runtime(CoreCLR) 底层的 GC 相关 Native 代码(如 ClrDataAccessgc_heap)。

虽然最终并没能定位问题的真实原因,但是理清了 GCvirtual_alloc过程中因内存限制检查、地址空间布局考量等因素导致预留失败的具体逻辑,并对 .NET 源码有了一定的认识,还是非常值得记录分享的。

缘起

前些日子,在 .NET 调试群里有位网友的 .NET 程序触发了 OOM (Out Of Memory) 异常,他在群里发了一些截图,询问大家是什么原因导致的。其中一张分析结果图如下:

oom

看着是 gc 过程中发生了内存不足的问题。大概率是要分配 16MB 左右的内存空间时失败了,但是另外一张截图显示,最大空闲块还有 50MB,如下图:

address-summary

由于没看过 .net 源码,只能根据自己的认知进行了回复,这种心里没底的感觉很不爽。正好最近想多研究下 .net,而且有源代码可查,为啥不看看呢?

下载源码

可以通过 git clone https://github.com/dotnet/runtime.git 命令把 .NET runtime 源码克隆到本地。还可以下载对应的诊断工具源码,包括但不限于 SOS。可以通过 git clone https://github.com/dotnet/diagnostics.git 克隆到本地。

说明: 最开始的想法是编译一份进行调试,折腾了半天,还遇到一些编译错误,这里就不展开了,后面会有一篇文章单独总结。

下载的版本与我我本地的运行时版本(8.0.7)不匹配,切换到对应的版本(git tag --list 然后 git checkout v8.0.7 )。

如何开始呢?截图中的关键信息描述就是入手点,当然从搜索入手了。

追踪 OOM 来源

打开文件内容搜索神器 FileLocator,搜索 Failed to reserve memory,在 AnalyzeOOMCommand.cs 中发现了匹配项。

search-keyword

关键类摘录如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
public class AnalyzeOOMCommand : ClrRuntimeCommandBase
{
public override void Invoke()
{
bool foundOne = false;
foreach (ClrOutOfMemoryInfo oom in Runtime.Heap.SubHeaps.Select(h => h.OomInfo).Where(oom => oom != null))
{
foundOne = true;

Console.WriteLine(oom.Reason switch
{
OutOfMemoryReason.Budget or OutOfMemoryReason.CantReserve => "OOM was due to an internal .Net error, likely a bug in the GC",
OutOfMemoryReason.CantCommit => "Didn't have enough memory to commit",
OutOfMemoryReason.LOH => "Didn't have enough memory to allocate an LOH segment",
OutOfMemoryReason.LowMem => "Low on memory during GC",
OutOfMemoryReason.UnproductiveFullGC => "Could not do a full GC",
_ => oom.Reason.ToString() // shouldn't happen, we handle all cases above
});

if (oom.GetMemoryFailure != GetMemoryFailureReason.None)
{
string message = oom.GetMemoryFailure switch
{
GetMemoryFailureReason.ReserveSegment => "Failed to reserve memory",
GetMemoryFailureReason.CommitSegmentBegin => "Didn't have enough memory to commit beginning of the segment",
GetMemoryFailureReason.CommitEphemeralSegment => "Didn't have enough memory to commit the new ephemeral segment",
GetMemoryFailureReason.GrowTable => "Didn't have enough memory to grow the internal GC data structures",
GetMemoryFailureReason.CommitTable => "Didn't have enough memory to commit the internal GC data structures",
_ => oom.GetMemoryFailure.ToString() // shouldn't happen, we handle all cases above
};

Console.WriteLine($"Details: {(oom.IsLargeObjectHeap ? "LOH" : "SOH")} {message} {oom.Size:n0} bytes");

// If it's a commit error (GetMemoryFailureReason.GrowTable can indicate a reserve
// or a commit error since we make one VirtualAlloc call to reserve and commit),
// we indicate the available commit space if we recorded it.
if (oom.AvailablePageFileMB != 0)
{
Console.WriteLine($" - on GC entry available commit space was {oom.AvailablePageFileMB:n0} MB");
}
}
}

if (!foundOne)
{
Console.WriteLine("There was no managed OOM due to allocations on the GC heap");
}
}
}

从代码看,与网友截图中的输出高度匹配,看样子是找对地方了。从代码逻辑可知,AnalyzeOOMCommand 命令会遍历堆,输出每个堆上的 ClrOutOfMemoryInfo 信息。关键代码如下:

foreach (ClrOutOfMemoryInfo oom in Runtime.Heap.SubHeaps.Select(h => h.OomInfo).Where(oom => oom != null))

至此可推断,发生 OOM 异常时,runtime 会把相关信息保存到堆上。SOS 等插件直接从对上取出对应信息,展示出来即可。看完这个类的实现,感觉我又行了,也能写一个类似的插件了,哈哈哈。不废话了,回到正题。单击 h.OomInfo 跳转到 OomInfo 的实现,可以发现其是类 ClrSubHeap 的一个属性字段。

说明:ClrSubHeap 并没有实现在 diagnostics 工程中,而是实现在 Microsoft.Diagnostics.Runtime 中,可以通过 git clone https://github.com/microsoft/clrmd.git 下载。

打开 Microsoft.Diagnostics.Runtime.sln,可以搜到 OomInfo 的实现,如下:

public ClrOutOfMemoryInfo? OomInfo => Heap.Helpers.GetOOMInfo(Address, out OomInfo oomInfo) ? new(oomInfo) : null;

OomInfo 来自 Heap.Helpers.GetOOMInfo() 函数。而 HeapClrSubHeap 的一个字段,在 ClrSubHeap 构造的时候传进来。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
public class ClrSubHeap : IClrSubHeap
{
internal ClrSubHeap(ClrHeap clrHeap, in SubHeapInfo subHeap)
{
Heap = clrHeap;
// ...
}

//...

public ClrHeap Heap { get; }
IClrHeap IClrSubHeap.Heap => Heap;

//...
}

跳转到 ClrHeap 的实现看下 Helpers 的来源

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
public sealed class ClrHeap : IClrHeap
{
//...

internal ClrHeap(ClrRuntime runtime, IMemoryReader memoryReader, IAbstractHeapProvider helpers, IAbstractTypeProvider typeHelpers, in GCState gcInfo)
{
Runtime = runtime;
_memoryReader = memoryReader;
Helpers = helpers; //<----

//...

SubHeaps = Helpers.EnumerateSubHeaps().Select(r => new ClrSubHeap(this, r)).ToImmutableArray();
Segments = SubHeaps.SelectMany(r => r.Segments).OrderBy(r => r.FirstObjectAddress).ToImmutableArray();
}

internal IAbstractHeap Helpers { get; } //<---

public ClrRuntime Runtime { get; }
}

可以发现 Helpers 是在 ClrHeap 构造的时候通过参数传进来的,而且 SubHeaps 也会在 ClrHeap 构造的时候被创建出来。

再看下 ClrHeap 是怎么被构造出来的。在 ClrHeap 的构造函数上方点击引用数量,跳转到对应的位置。可以发现其来自 ClrRuntimeHeap 属性。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
public sealed class ClrRuntime : IClrRuntime
{
//...

public ClrHeap Heap
{
get
{
ClrHeap? heap = _heap;
while (heap is null) // Flush can cause a race.
{
IAbstractHeap? heapHelpers = GetService<IAbstractHeap>(); //<---
IAbstractTypeHelpers? typeHelpers = GetService<IAbstractTypeHelpers>();

// These are defined as non-nullable but just in case, double check we have a non-null instance.
if (heapHelpers is null || typeHelpers is null)
throw new NotSupportedException("Unable to create a ClrHeap for this runtime.");

heap = new(this, DataTarget.DataReader, heapHelpers, typeHelpers);
Interlocked.CompareExchange(ref _heap, heap, null);
heap = _heap;
}

return heap;
}
}
}

可以发现 heapHelpers 参数来自 IAbstractHeap? heapHelpers = GetService<IAbstractHeap>();

跳转到 GetService() 的实现,如下:

1
internal T? GetService<T>() where T: class => (T?)_services.GetService(typeof(T));

可以发现,GetService<T>() 是通过 _services.GetService(typeof(T)) 实现的,再看下 _services 的来源。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
public sealed class ClrRuntime : IClrRuntime
{
private readonly IServiceProvider _services;
private volatile ClrHeap? _heap;
private ImmutableArray<ClrThread> _threads;
private volatile DomainAndModules? _domainAndModules;

private IAbstractRuntime? _runtime;
private IAbstractComHelpers? _comHelpers;
private IAbstractMethodLocator? _methodLocator;
private IAbstractDacController? _controller;

internal ClrRuntime(ClrInfo clrInfo, IServiceProvider services)
{
ClrInfo = clrInfo;
DataTarget = clrInfo.DataTarget;
_services = services; //<---
}
//...
}

发现 _services 是在 ClrRuntime 构造的时候传进来的。点击 ClrRuntime 的构造函数上方的引用计数,可以发现 ClrRuntimeClrInfoCreateRuntimeWorker() 函数中被创建,services 参数来自 ClrInfoProvider.GetDacServices()

1
2
3
4
5
6
7
8
public sealed class ClrInfo : IClrInfo
{
private ClrRuntime CreateRuntimeWorker(string? dacPath, bool ignoreMismatch, bool verifySignature)
{
IServiceProvider services = ClrInfoProvider.GetDacServices(this, dacPath, ignoreMismatch, verifySignature);
return new ClrRuntime(this, services);
}
}

再看下 ClrInfoProvider 的来源,发现其是 ClrInfo 的属性成员,在 ClrInfo 构造的时候被初始化。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
public sealed class ClrInfo : IClrInfo
{
internal ClrInfo(DataTarget dt, ModuleInfo module, Version clrVersion, IClrInfoProvider provider)
{
DataTarget = dt ?? throw new ArgumentNullException(nameof(dt));
ModuleInfo = module ?? throw new ArgumentNullException(nameof(module));
ClrInfoProvider = provider ?? throw new ArgumentNullException(nameof(provider)); //<---
Version = clrVersion ?? throw new ArgumentNullException(nameof(clrVersion));
}

/// <summary>
/// The DataTarget containing this ClrInfo.
/// </summary>
public DataTarget DataTarget { get; }

/// <summary>
/// The IClrInfoProvider which created this ClrInfo.
/// </summary>
internal IClrInfoProvider ClrInfoProvider { get; } //<---

IDataTarget IClrInfo.DataTarget => DataTarget;

//...
}

再看看 ClrInfo 是在哪里被创建的,点击 ClrInfo 的构造函数上方的引用计数,可以发现其来自 DotNetClrInfoProviderCreateClrInfo() 函数。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
internal class DotNetClrInfoProvider : IClrInfoProvider
{
//...

protected ClrInfo CreateClrInfo(DataTarget dataTarget, ModuleInfo module, ulong runtimeInfo, ClrFlavor flavor)
{
//...

ClrInfo result = new(dataTarget, module, version, this) //<---
{
Flavor = flavor,
DebuggingLibraries = orderedDebugLibraries.ToImmutableArray(),
ContractDescriptorAddress = contractDescriptor,
IndexFileSize = indexFileSize,
IndexTimeStamp = indexTimeStamp,
BuildId = buildId,
};

return result;
}
}

由以上代码可知,ClrInfo 构造函数的最后一个参数是 this,所以 ClrInfo 中的 ClrInfoProviderDotNetClrInfoProvider 类型的对象。再来查看一下 DotNetClrInfoProvider::GetDacServices() 函数。

1
2
3
4
5
public IServiceProvider GetDacServices(ClrInfo clrInfo, string? providedPath, bool ignoreMismatch, bool verifySignature)
{
DacLibrary library = GetDacLibraryFromPath(clrInfo, providedPath, ignoreMismatch, verifySignature);
return new DacServiceProvider(clrInfo, library);
}

会返回 DacServiceProvider 类型的对象,所以 ClrRuntime._services 实际是 DacServiceProvider 类型的对象。ClrRuntimeHeap 属性中调用的 IAbstractHeap? heapHelpers = GetService<IAbstractHeap>() 就相当于调用的是 DacServiceProvider.GetService(IAbstractHeap)

看看 DacServiceProvider.GetService(Type) 的实现,如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
internal class DacServiceProvider : IServiceProvider, IDisposable, IAbstractDacController
{
//...

public object? GetService(Type serviceType)
{
if (serviceType == typeof(IAbstractRuntime))
return _runtime ??= new DacRuntime(_clrInfo, _process, _sos, _sos13);

if (serviceType == typeof(IAbstractHeap)) //<---
{
IAbstractHeap? heap = _heapHelper;
if (heap is not null)
return heap;

if (_sos.GetGCHeapData(out GCInfo data) && _sos.GetCommonMethodTables(out CommonMethodTables mts) && mts.ObjectMethodTable != 0)
return _heapHelper = new DacHeap(_sos, _sos8, _sos12, _sos16, _dataReader, data, mts);

return null;
}

// ...
}

// ...
}

所以 ClrHeap 中的 Helpers 成员的类型是 DacHeap,看看其 GetOOMInfo() 的实现。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
internal sealed class DacHeap : IAbstractHeap
{
public bool GetOOMInfo(ulong subHeapAddress, out OomInfo oomInfo)
{
DacOOMData oomData;
if (subHeapAddress != 0)
{
if (!_sos.GetOOMData(subHeapAddress, out oomData) || oomData.Reason == OutOfMemoryReason.None && oomData.GetMemoryFailure == GetMemoryFailureReason.None)
{
oomInfo = default;
return false;
}
}
else
{
if (!_sos.GetOOMData(out oomData) || oomData.Reason == OutOfMemoryReason.None && oomData.GetMemoryFailure == GetMemoryFailureReason.None)
{
oomInfo = default;
return false;
}
}

oomInfo = new()
{
AllocSize = oomData.AllocSize,
AvailablePageFileMB = oomData.AvailablePageFileMB,
GCIndex = oomData.GCIndex,
GetMemoryFailure = oomData.GetMemoryFailure,
IsLOH = oomData.IsLOH != 0,
Reason = oomData.Reason,
Size = oomData.Size,
};
return true;
}
}

调用了 _sos.GetOOMData(out oomData)_sosDacHeap 的成员变量,来自 DacHeap 构造函数的第一个参数。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
internal sealed class DacHeap : IAbstractHeap
{
private readonly SOSDac _sos; //<--
private readonly SOSDac8? _sos8;

//...

public DacHeap(SOSDac sos, SOSDac8? sos8, SosDac12? sos12, ISOSDac16? sos16, IMemoryReader reader, in GCInfo gcInfo, in CommonMethodTables commonMethodTables)
{
_sos = sos; //<--
_sos8 = sos8;

//...
}
}

DacHeap 又是在 DacServiceProvider.GetService(Type) 中创建的,关键代码是

return _heapHelper = new DacHeap(_sos, _sos8, _sos12, _sos16, _dataReader, data, mts);

传递给 DacHeap 的第一个参数是 DacServiceProvider 的成员变量 _sos。该成员变量是在 DacServiceProvider 的构造函数中初始化的。构造函数如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
internal class DacServiceProvider : IServiceProvider, IDisposable, IAbstractDacController
{
private readonly ClrInfo _clrInfo;
private readonly IDataReader _dataReader;

private readonly DacLibrary _dac;
private readonly ClrDataProcess _process;
private readonly SOSDac _sos;
private readonly SOSDac6? _sos6;

//...

public DacServiceProvider(ClrInfo clrInfo, DacLibrary library)
{
_clrInfo = clrInfo;
_dataReader = _clrInfo.DataTarget.DataReader;

_dac = library;
_process = library.CreateClrDataProcess();
_sos = _process.CreateSOSDacInterface() ?? throw new InvalidOperationException($"Could not create ISOSDacInterface."); //<--
_sos6 = _process.CreateSOSDacInterface6();

library.DacDataTarget.SetMagicCallback(_process.Flush);
IsThreadSafe = _sos13 is not null || RuntimeInformation.IsOSPlatform(OSPlatform.Windows);
}

// ...
}

_sos 是由 _process.CreateSOSDacInterface() 创建的,而 _process 的类型是 ClrDataProcess,看一下 _process.CreateSOSDacInterface() 的实现,如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
internal sealed unsafe class ClrDataProcess : CallableCOMWrapper
{
private static readonly Guid IID_IXCLRDataProcess = new("5c552ab6-fc09-4cb3-8e36-22fa03c798b7");
private readonly DacLibrary _library;

//...

public SOSDac? CreateSOSDacInterface()
{
IntPtr result = QueryInterface(SOSDac.IID_ISOSDac);
if (result == IntPtr.Zero)
return null;

try
{
return new SOSDac(_library, result);
}
catch (InvalidOperationException)
{
return null;
}
}

//...
}

该函数会返回 SOSDac 类型的对象,该类型构造函数的第二个参数是通过 QueryInterface(SOSDac.IID_ISOSDac) 得到的,SOSDac.IID_ISOSDac 的值是 436f00f2-b42a-4b9f-870c-e73db66ae930,是 SOSDac 类的静态变量,SOSDac 的定义如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
internal sealed unsafe class SOSDac : CallableCOMWrapper
{
internal static readonly Guid IID_ISOSDac = new("436f00f2-b42a-4b9f-870c-e73db66ae930");

private readonly DacLibrary _library;
private volatile Dictionary<int, string>? _regNames;
private volatile Dictionary<ulong, string>? _frameNames;

public SOSDac(DacLibrary? library, IntPtr ptr)
: base(library?.OwningLibrary, IID_ISOSDac, ptr)
{
_library = library ?? throw new ArgumentNullException(nameof(library));
}

private ref readonly ISOSDacVTable VTable => ref Unsafe.AsRef<ISOSDacVTable>(_vtable);

public SOSDac(DacLibrary lib, CallableCOMWrapper toClone) : base(toClone)
{
_library = lib;
}

//...

public HResult GetOOMData(out DacOOMData oomData) => VTable.GetOOMStaticData(Self, out oomData);

public HResult GetOOMData(ulong address, out DacOOMData oomData) => VTable.GetOOMData(Self, address, out oomData);
}

此类什么有用的事情都没做,都是调用 VTable 中的实现,而且其基类是 CallableCOMWrapper,可以大胆猜测此类是一个 COM 调用类,真正的实现在 native 层。是不是呢?到 native 层搜搜就知道了。

查看 clr runtime 实现

native 代码中搜索 436f00f2-b42a-4b9f-870c-e73db66ae930,可以在 sospriv.h 头文件中搜到。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
MIDL_INTERFACE("436f00f2-b42a-4b9f-870c-e73db66ae930") //<---
ISOSDacInterface : public IUnknown
{
public:
//...

virtual HRESULT STDMETHODCALLTYPE GetOOMData(
CLRDATA_ADDRESS oomAddr,
struct DacpOomData *data) = 0;

virtual HRESULT STDMETHODCALLTYPE GetOOMStaticData(
struct DacpOomData *data) = 0;

//...
};

继续搜索 ISOSDacInterface,可以在 daccess.cpp 中找到使用的地方,对应的实现类是 ClrDataAccess

说明: 对应的声明文件在 D:\dotnet\runtime\src\coreclr\debug\daccess\dacimpl.h

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
//D:\dotnet\runtime\src\coreclr\debug\daccess\daccess.cpp

STDMETHODIMP
ClrDataAccess::QueryInterface(THIS_
IN REFIID interfaceId,
OUT PVOID* iface)
{
void* ifaceRet;

if (IsEqualIID(interfaceId, IID_IUnknown) ||
IsEqualIID(interfaceId, __uuidof(IXCLRDataProcess)) ||
IsEqualIID(interfaceId, __uuidof(IXCLRDataProcess2)))
{
ifaceRet = static_cast<IXCLRDataProcess2*>(this);
}
else if (IsEqualIID(interfaceId, __uuidof(ICLRDataEnumMemoryRegions)))
{
ifaceRet = static_cast<ICLRDataEnumMemoryRegions*>(this);
}
else if (IsEqualIID(interfaceId, __uuidof(ISOSDacInterface))) //<---
{
ifaceRet = static_cast<ISOSDacInterface*>(this);
}
else if (IsEqualIID(interfaceId, __uuidof(ISOSDacInterface2)))
{
ifaceRet = static_cast<ISOSDacInterface2*>(this);
}
//...

AddRef();
*iface = ifaceRet;
return S_OK;
}

可以查看 ClrDataAccess::GetOOMData() 的具体实现,如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
//D:\dotnet\runtime\src\coreclr\debug\daccess\request.cpp

HRESULT
ClrDataAccess::GetOOMData(CLRDATA_ADDRESS oomAddr, struct DacpOomData *data)
{
if (oomAddr == 0 || data == NULL)
return E_INVALIDARG;

SOSDacEnter();
*data = {};

if (!GCHeapUtilities::IsServerHeap())
hr = E_FAIL; // doesn't make sense to call this on WKS mode

#ifdef FEATURE_SVR_GC
else
hr = ServerOomData(oomAddr, data);
#else
_ASSERTE_MSG(false, "IsServerHeap returned true but FEATURE_SVR_GC not defined");
hr = E_NOTIMPL;
#endif //FEATURE_SVR_GC

SOSDacLeave();
return hr;
}

ClrDataAccess::ServerOomData() 的实现如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
//D:\dotnet\runtime\src\coreclr\debug\daccess\request_svr.cpp

HRESULT
ClrDataAccess::ServerOomData(CLRDATA_ADDRESS addr, DacpOomData *oomData)
{
TADDR heapAddress = TO_TADDR(addr);
dac_gc_heap heap = LoadGcHeapData(heapAddress);
dac_gc_heap* pHeap = &heap;

oom_history pOOMInfo = pHeap->oom_info;
oomData->reason = pOOMInfo.reason;
oomData->alloc_size = pOOMInfo.alloc_size;
oomData->available_pagefile_mb = pOOMInfo.available_pagefile_mb;
oomData->gc_index = pOOMInfo.gc_index;
oomData->fgm = pOOMInfo.fgm;
oomData->size = pOOMInfo.size;
oomData->loh_p = pOOMInfo.loh_p;

return S_OK;
}

由以上代码可知,oomData 来自 pHeap->oom_info,看下 oom_info 的定义,如下

1
2
3
4
5
6
7
8
//D:\dotnet\runtime\src\coreclr\gc\gcpriv.h

class gc_heap
{
//...
PER_HEAP_FIELD_DIAG_ONLY oom_history oom_info;
//...
}

其类型是 oom_history,查看定义,如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
//D:\dotnet\runtime\src\coreclr\gc\gcinterface.dac.h

// Reasons why an OOM might occur, recorded in the oom_history
// struct below.
enum oom_reason
{
oom_no_failure = 0,
oom_budget = 1,
oom_cant_commit = 2,
oom_cant_reserve = 3,
oom_loh = 4,
oom_low_mem = 5,
oom_unproductive_full_gc = 6
};

/*!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!*/
/* If you modify failure_get_memory and */
/* oom_reason be sure to make the corresponding */
/* changes in ClrMD. */
/*!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!*/
enum failure_get_memory
{
fgm_no_failure = 0,
fgm_reserve_segment = 1,
fgm_commit_segment_beg = 2,
fgm_commit_eph_segment = 3,
fgm_grow_table = 4,
fgm_commit_table = 5
};

// A record of the last OOM that occurred in the GC, with some
// additional information as to what triggered the OOM.
struct oom_history
{
oom_reason reason;
size_t alloc_size;
uint8_t* reserved;
uint8_t* allocated;
size_t gc_index;
failure_get_memory fgm;
size_t size;
size_t available_pagefile_mb;
BOOL loh_p;
};

看到以上定义就太亲切了。根据目前了解到的信息,这个结构体应该是当发生 OOM 时,runtime 设置的结构体。可以在代码中搜索使用 fgm_reserve_segment 的地方,一共就搜到两处,一处是其定义的地方,一处是使用的地方,使用的代码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
heap_segment*
gc_heap::get_segment (size_t size, gc_oh_num oh)
{
assert(oh != gc_oh_num::unknown);
BOOL uoh_p = (oh == gc_oh_num::loh) || (oh == gc_oh_num::poh);
if (heap_hard_limit)
return NULL;

heap_segment* result = 0;

if (segment_standby_list != 0)
{
result = segment_standby_list;
heap_segment* last = 0;
while (result)
{
size_t hs = (size_t)(heap_segment_reserved (result) - (uint8_t*)result);
if ((hs >= size) && ((hs / 2) < size))
{
dprintf (2, ("Hoarded segment %zx found", (size_t) result));
if (last)
{
heap_segment_next (last) = heap_segment_next (result);
}
else
{
segment_standby_list = heap_segment_next (result);
}
break;
}
else
{
last = result;
result = heap_segment_next (result);
}
}
}

if (!result)
{
void* mem = virtual_alloc (size);
if (!mem)
{
fgm_result.set_fgm (fgm_reserve_segment, size, uoh_p); //<---
return 0;
}

//...
}

//...
}

可以发现,当 virtual_alloc (size) 的返回值是空时,会设置 fgm_reserve_segment。再看看 virtual_alloc 的实现,如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63

void* virtual_alloc (size_t size)
{
return virtual_alloc(size, false);
}

void* virtual_alloc (size_t size, bool use_large_pages_p, uint16_t numa_node)
{
size_t requested_size = size;

if ((gc_heap::reserved_memory_limit - gc_heap::reserved_memory) < requested_size)
{
gc_heap::reserved_memory_limit =
GCScan::AskForMoreReservedMemory (gc_heap::reserved_memory_limit, requested_size);
if ((gc_heap::reserved_memory_limit - gc_heap::reserved_memory) < requested_size)
{
return 0; //<---
}
}

uint32_t flags = VirtualReserveFlags::None;
#ifndef FEATURE_USE_SOFTWARE_WRITE_WATCH_FOR_GC_HEAP
if (virtual_alloc_hardware_write_watch)
{
flags = VirtualReserveFlags::WriteWatch;
}
#endif // !FEATURE_USE_SOFTWARE_WRITE_WATCH_FOR_GC_HEAP

//<--- use_large_pages_p 是 false,会调用 GCToOSInterface::VirtualReserve
void* prgmem = use_large_pages_p ?
GCToOSInterface::VirtualReserveAndCommitLargePages(requested_size, numa_node) :
GCToOSInterface::VirtualReserve(requested_size, card_size * card_word_width, flags, numa_node);
void *aligned_mem = prgmem;

// We don't want (prgmem + size) to be right at the end of the address space
// because we'd have to worry about that everytime we do (address + size).
// We also want to make sure that we leave loh_size_threshold at the end
// so we allocate a small object we don't need to worry about overflow there
// when we do alloc_ptr+size.
if (prgmem)
{
uint8_t* end_mem = (uint8_t*)prgmem + requested_size;

if ((end_mem == 0) || ((size_t)(MAX_PTR - end_mem) <= END_SPACE_AFTER_GC))
{
GCToOSInterface::VirtualRelease (prgmem, requested_size);
dprintf (2, ("Virtual Alloc size %zd returned memory right against 4GB [%zx, %zx[ - discarding",
requested_size, (size_t)prgmem, (size_t)((uint8_t*)prgmem+requested_size)));
prgmem = 0;
aligned_mem = 0;
}
}

if (prgmem)
{
gc_heap::reserved_memory += requested_size;
}

dprintf (2, ("Virtual Alloc size %zd: [%zx, %zx[",
requested_size, (size_t)prgmem, (size_t)((uint8_t*)prgmem+requested_size)));

return aligned_mem;
}

以上代码,一共有三个地方会导致返回空,第一处代码如下:

1
2
3
4
5
6
7
8
9
if ((gc_heap::reserved_memory_limit - gc_heap::reserved_memory) < requested_size)
{
gc_heap::reserved_memory_limit =
GCScan::AskForMoreReservedMemory (gc_heap::reserved_memory_limit, requested_size);
if ((gc_heap::reserved_memory_limit - gc_heap::reserved_memory) < requested_size)
{
return 0; //<---
}
}

大概逻辑是,如果保留内存限值gc_heap::reserved_memory_limit)- 已保留的内存gc_heap::reserved_memory)小于 请求字节数requested_size),就调用 GCScan::AskForMoreReservedMemory() 请求保留更多内存,该函数会返回新的限值。如果 新限值 - 已保留的内存 还是小于 请求字节数 就返回空。

第二处代码如下:

1
2
3
4
void* prgmem = use_large_pages_p ?
GCToOSInterface::VirtualReserveAndCommitLargePages(requested_size, numa_node) :
GCToOSInterface::VirtualReserve(requested_size, card_size * card_word_width, flags, numa_node);
void *aligned_mem = prgmem;

由于,use_large_pages_pfalse,会调用 GCToOSInterface::VirtualReserve(),该函数底层又会直接调用 VirtualAlloc()

第三处代码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
if (prgmem)
{
uint8_t* end_mem = (uint8_t*)prgmem + requested_size;

if ((end_mem == 0) || ((size_t)(MAX_PTR - end_mem) <= END_SPACE_AFTER_GC))
{
GCToOSInterface::VirtualRelease (prgmem, requested_size);
dprintf (2, ("Virtual Alloc size %zd returned memory right against 4GB [%zx, %zx[ - discarding",
requested_size, (size_t)prgmem, (size_t)((uint8_t*)prgmem+requested_size)));
prgmem = 0;
aligned_mem = 0;
}
}

MAX_PTR 为最大的无符号整数,end_mem 是此次分配的内存段的结束位置,如果结束位置后面的空间不能容纳大对象堆,也返回空。

至此,本次折腾就告一段路了,第一张图片中的报错信息,基本上是 virtual_alloc 失败导致的问题。为什么 virtual_alloc 会失败,我到现在也没想明白。因为只尝试保留内存空间,并没有进行提交,按理说在有足够大的空闲内存空间时,不应该失败才对。什么情况下 VirtualAlloc() 会失败,还望各位大牛不吝赐教!

总结

  • 再次强烈推荐一下 FileLocator 文件内容搜索神器,你值得拥有

参考资料

  • .net 源码
BianChengNan wechat
扫描左侧二维码关注公众号,扫描右侧二维码加我个人微信:)
0%