深入.NET Runtime：一次 OOM 异常的分析与源码追踪之旅

摘要

本文记录了一次对 .NET 应用程序发生的“内存不足”（OOM）异常进行的深度源码级调查。问题始于一个看似矛盾的现象：诊断工具显示有足够大的空闲内存块（约 50MB），但垃圾回收（GC）过程却在尝试预留较小内存段（约 16MB）时失败。

为了探究根源，笔者逆向追踪了诊断工具（如SOS）输出的数据链路：从高层诊断命令（AnalyzeOOMCommand）入手，逐步深入Microsoft.Diagnostics.Runtime (CLRMD)库、托管辅助类、Dac 接口，最终直达 .NET Runtime(CoreCLR) 底层的 GC 相关 Native 代码（如 ClrDataAccess、gc_heap）。

虽然最终并没能定位问题的真实原因，但是理清了 GC 在 virtual_alloc过程中因内存限制检查、地址空间布局考量等因素导致预留失败的具体逻辑，并对 .NET 源码有了一定的认识，还是非常值得记录分享的。

缘起

前些日子，在 .NET 调试群里有位网友的 .NET 程序触发了 OOM (Out Of Memory) 异常，他在群里发了一些截图，询问大家是什么原因导致的。其中一张分析结果图如下：

oom

看着是 gc 过程中发生了内存不足的问题。大概率是要分配 16MB 左右的内存空间时失败了，但是另外一张截图显示，最大空闲块还有 50MB，如下图：

address-summary

由于没看过 .net 源码，只能根据自己的认知进行了回复，这种心里没底的感觉很不爽。正好最近想多研究下 .net，而且有源代码可查，为啥不看看呢？

下载源码

可以通过 git clone https://github.com/dotnet/runtime.git 命令把 .NET runtime 源码克隆到本地。还可以下载对应的诊断工具源码，包括但不限于 SOS。可以通过 git clone https://github.com/dotnet/diagnostics.git 克隆到本地。

说明： 最开始的想法是编译一份进行调试，折腾了半天，还遇到一些编译错误，这里就不展开了，后面会有一篇文章单独总结。

下载的版本与我我本地的运行时版本（8.0.7）不匹配，切换到对应的版本（git tag --list 然后 git checkout v8.0.7 ）。

如何开始呢？截图中的关键信息描述就是入手点，当然从搜索入手了。

追踪 OOM 来源

打开文件内容搜索神器 FileLocator，搜索 Failed to reserve memory，在 AnalyzeOOMCommand.cs 中发现了匹配项。

关键类摘录如下：

public class AnalyzeOOMCommand : ClrRuntimeCommandBase
{
    public override void Invoke()
    {
        bool foundOne = false;
        foreach (ClrOutOfMemoryInfo oom in Runtime.Heap.SubHeaps.Select(h => h.OomInfo).Where(oom => oom != null))
        {
            foundOne = true;

            Console.WriteLine(oom.Reason switch
            {
                OutOfMemoryReason.Budget or OutOfMemoryReason.CantReserve => "OOM was due to an internal .Net error, likely a bug in the GC",
                OutOfMemoryReason.CantCommit => "Didn't have enough memory to commit",
                OutOfMemoryReason.LOH => "Didn't have enough memory to allocate an LOH segment",
                OutOfMemoryReason.LowMem => "Low on memory during GC",
                OutOfMemoryReason.UnproductiveFullGC => "Could not do a full GC",
                _ => oom.Reason.ToString() // shouldn't happen, we handle all cases above
            });

            if (oom.GetMemoryFailure != GetMemoryFailureReason.None)
            {
                string message = oom.GetMemoryFailure switch
                {
                    GetMemoryFailureReason.ReserveSegment => "Failed to reserve memory",
                    GetMemoryFailureReason.CommitSegmentBegin => "Didn't have enough memory to commit beginning of the segment",
                    GetMemoryFailureReason.CommitEphemeralSegment => "Didn't have enough memory to commit the new ephemeral segment",
                    GetMemoryFailureReason.GrowTable => "Didn't have enough memory to grow the internal GC data structures",
                    GetMemoryFailureReason.CommitTable => "Didn't have enough memory to commit the internal GC data structures",
                    _ => oom.GetMemoryFailure.ToString() // shouldn't happen, we handle all cases above
                };

                Console.WriteLine($"Details: {(oom.IsLargeObjectHeap ? "LOH" : "SOH")} {message} {oom.Size:n0} bytes");

                // If it's a commit error (GetMemoryFailureReason.GrowTable can indicate a reserve
                // or a commit error since we make one VirtualAlloc call to reserve and commit),
                // we indicate the available commit space if we recorded it.
                if (oom.AvailablePageFileMB != 0)
                {
                    Console.WriteLine($" - on GC entry available commit space was {oom.AvailablePageFileMB:n0} MB");
                }
            }
        }

        if (!foundOne)
        {
            Console.WriteLine("There was no managed OOM due to allocations on the GC heap");
        }
    }
}

从代码看，与网友截图中的输出高度匹配，看样子是找对地方了。从代码逻辑可知，AnalyzeOOMCommand 命令会遍历堆，输出每个堆上的 ClrOutOfMemoryInfo 信息。关键代码如下：

foreach (ClrOutOfMemoryInfo oom in Runtime.Heap.SubHeaps.Select(h => h.OomInfo).Where(oom => oom != null))

至此可推断，发生 OOM 异常时，runtime 会把相关信息保存到堆上。SOS 等插件直接从对上取出对应信息，展示出来即可。看完这个类的实现，感觉我又行了，也能写一个类似的插件了，哈哈哈。不废话了，回到正题。单击 h.OomInfo 跳转到 OomInfo 的实现，可以发现其是类 ClrSubHeap 的一个属性字段。

说明：ClrSubHeap 并没有实现在 diagnostics 工程中，而是实现在 Microsoft.Diagnostics.Runtime 中，可以通过 git clone https://github.com/microsoft/clrmd.git 下载。

打开 Microsoft.Diagnostics.Runtime.sln，可以搜到 OomInfo 的实现，如下：

public ClrOutOfMemoryInfo? OomInfo => Heap.Helpers.GetOOMInfo(Address, out OomInfo oomInfo) ? new(oomInfo) : null;

OomInfo 来自 Heap.Helpers.GetOOMInfo() 函数。而 Heap 是 ClrSubHeap 的一个字段，在 ClrSubHeap 构造的时候传进来。

public class ClrSubHeap : IClrSubHeap
{
    internal ClrSubHeap(ClrHeap clrHeap, in SubHeapInfo subHeap)
    {
        Heap = clrHeap;
        // ...
    }

    //...

    public ClrHeap Heap { get; }
    IClrHeap IClrSubHeap.Heap => Heap;

    //...
}

跳转到 ClrHeap 的实现看下 Helpers 的来源

public sealed class ClrHeap : IClrHeap
{
    //...

    internal ClrHeap(ClrRuntime runtime, IMemoryReader memoryReader, IAbstractHeapProvider helpers, IAbstractTypeProvider typeHelpers, in GCState gcInfo)
    {
        Runtime = runtime;
        _memoryReader = memoryReader;
        Helpers = helpers; //<----
        
        //...

        SubHeaps = Helpers.EnumerateSubHeaps().Select(r => new ClrSubHeap(this, r)).ToImmutableArray();
        Segments = SubHeaps.SelectMany(r => r.Segments).OrderBy(r => r.FirstObjectAddress).ToImmutableArray();
    }

    internal IAbstractHeap Helpers { get; } //<---

    public ClrRuntime Runtime { get; }
}

可以发现 Helpers 是在 ClrHeap 构造的时候通过参数传进来的，而且 SubHeaps 也会在 ClrHeap 构造的时候被创建出来。

再看下 ClrHeap 是怎么被构造出来的。在 ClrHeap 的构造函数上方点击引用数量，跳转到对应的位置。可以发现其来自 ClrRuntime 的 Heap 属性。

public sealed class ClrRuntime : IClrRuntime
{
    //...

    public ClrHeap Heap
    {
        get
        {
            ClrHeap? heap = _heap;
            while (heap is null) // Flush can cause a race.
            {
                IAbstractHeap? heapHelpers = GetService<IAbstractHeap>(); //<---
                IAbstractTypeHelpers? typeHelpers = GetService<IAbstractTypeHelpers>();

                // These are defined as non-nullable but just in case, double check we have a non-null instance.
                if (heapHelpers is null || typeHelpers is null)
                    throw new NotSupportedException("Unable to create a ClrHeap for this runtime.");

                heap = new(this, DataTarget.DataReader, heapHelpers, typeHelpers);
                Interlocked.CompareExchange(ref _heap, heap, null);
                heap = _heap;
             }

            return heap;
        }
    }
}

可以发现 heapHelpers 参数来自 IAbstractHeap? heapHelpers = GetService<IAbstractHeap>();

跳转到 GetService() 的实现，如下：

1	internal T? GetService<T>() where T: class => (T?)_services.GetService(typeof(T));

可以发现，GetService<T>() 是通过 _services.GetService(typeof(T)) 实现的，再看下 _services 的来源。

public sealed class ClrRuntime : IClrRuntime
{
    private readonly IServiceProvider _services;
    private volatile ClrHeap? _heap;
    private ImmutableArray<ClrThread> _threads;
    private volatile DomainAndModules? _domainAndModules;

    private IAbstractRuntime? _runtime;
    private IAbstractComHelpers? _comHelpers;
    private IAbstractMethodLocator? _methodLocator;
    private IAbstractDacController? _controller;

    internal ClrRuntime(ClrInfo clrInfo, IServiceProvider services)
    {
        ClrInfo = clrInfo;
        DataTarget = clrInfo.DataTarget;
        _services = services; //<---
    }
    //...
}

发现 _services 是在 ClrRuntime 构造的时候传进来的。点击 ClrRuntime 的构造函数上方的引用计数，可以发现 ClrRuntime 在ClrInfo 的 CreateRuntimeWorker() 函数中被创建，services 参数来自 ClrInfoProvider.GetDacServices()。

public sealed class ClrInfo : IClrInfo
{
    private ClrRuntime CreateRuntimeWorker(string? dacPath, bool ignoreMismatch, bool verifySignature)
    {
        IServiceProvider services = ClrInfoProvider.GetDacServices(this, dacPath, ignoreMismatch, verifySignature);
        return new ClrRuntime(this, services);
    }
}

再看下 ClrInfoProvider 的来源，发现其是 ClrInfo 的属性成员，在 ClrInfo 构造的时候被初始化。

public sealed class ClrInfo : IClrInfo
{
    internal ClrInfo(DataTarget dt, ModuleInfo module, Version clrVersion, IClrInfoProvider provider)
    {
        DataTarget = dt ?? throw new ArgumentNullException(nameof(dt));
        ModuleInfo = module ?? throw new ArgumentNullException(nameof(module));
        ClrInfoProvider = provider ?? throw new ArgumentNullException(nameof(provider)); //<---
        Version = clrVersion ?? throw new ArgumentNullException(nameof(clrVersion));
    }

    /// <summary>
    /// The DataTarget containing this ClrInfo.
    /// </summary>
    public DataTarget DataTarget { get; }

    /// <summary>
    /// The IClrInfoProvider which created this ClrInfo.
    /// </summary>
    internal IClrInfoProvider ClrInfoProvider { get; } //<---

    IDataTarget IClrInfo.DataTarget => DataTarget;
    
    //...
}

再看看 ClrInfo 是在哪里被创建的，点击 ClrInfo 的构造函数上方的引用计数，可以发现其来自 DotNetClrInfoProvider 的 CreateClrInfo() 函数。

internal class DotNetClrInfoProvider : IClrInfoProvider
{
    //...

    protected ClrInfo CreateClrInfo(DataTarget dataTarget, ModuleInfo module, ulong runtimeInfo, ClrFlavor flavor)
    {
        //...

        ClrInfo result = new(dataTarget, module, version, this) //<---
        {
            Flavor = flavor,
            DebuggingLibraries = orderedDebugLibraries.ToImmutableArray(),
            ContractDescriptorAddress = contractDescriptor,
            IndexFileSize = indexFileSize,
            IndexTimeStamp = indexTimeStamp,
            BuildId = buildId,
        };

        return result;
    }
}

由以上代码可知，ClrInfo 构造函数的最后一个参数是 this，所以 ClrInfo 中的 ClrInfoProvider 是 DotNetClrInfoProvider 类型的对象。再来查看一下 DotNetClrInfoProvider::GetDacServices() 函数。

public IServiceProvider GetDacServices(ClrInfo clrInfo, string? providedPath, bool ignoreMismatch, bool verifySignature)
{
    DacLibrary library = GetDacLibraryFromPath(clrInfo, providedPath, ignoreMismatch, verifySignature);
    return new DacServiceProvider(clrInfo, library);
}

会返回 DacServiceProvider 类型的对象，所以 ClrRuntime._services 实际是 DacServiceProvider 类型的对象。ClrRuntime 的 Heap 属性中调用的 IAbstractHeap? heapHelpers = GetService<IAbstractHeap>() 就相当于调用的是 DacServiceProvider.GetService(IAbstractHeap)。

看看 DacServiceProvider.GetService(Type) 的实现，如下：

internal class DacServiceProvider : IServiceProvider, IDisposable, IAbstractDacController
{
    //...

    public object? GetService(Type serviceType)
    {
        if (serviceType == typeof(IAbstractRuntime))
            return _runtime ??= new DacRuntime(_clrInfo, _process, _sos, _sos13);

        if (serviceType == typeof(IAbstractHeap)) //<---
        {
            IAbstractHeap? heap = _heapHelper;
            if (heap is not null)
                return heap;

            if (_sos.GetGCHeapData(out GCInfo data) && _sos.GetCommonMethodTables(out CommonMethodTables mts) && mts.ObjectMethodTable != 0)
                return _heapHelper = new DacHeap(_sos, _sos8, _sos12, _sos16, _dataReader, data, mts);

            return null;
        }

        // ...
    }
    
    // ...
}

所以 ClrHeap 中的 Helpers 成员的类型是 DacHeap，看看其 GetOOMInfo() 的实现。

internal sealed class DacHeap : IAbstractHeap
{
    public bool GetOOMInfo(ulong subHeapAddress, out OomInfo oomInfo)
    {
        DacOOMData oomData;
        if (subHeapAddress != 0)
        {
            if (!_sos.GetOOMData(subHeapAddress, out oomData) || oomData.Reason == OutOfMemoryReason.None && oomData.GetMemoryFailure == GetMemoryFailureReason.None)
            {
                oomInfo = default;
                return false;
            }
        }
        else
        {
            if (!_sos.GetOOMData(out oomData) || oomData.Reason == OutOfMemoryReason.None && oomData.GetMemoryFailure == GetMemoryFailureReason.None)
            {
                oomInfo = default;
                return false;
            }
        }

        oomInfo = new()
        {
            AllocSize = oomData.AllocSize,
            AvailablePageFileMB = oomData.AvailablePageFileMB,
            GCIndex = oomData.GCIndex,
            GetMemoryFailure = oomData.GetMemoryFailure,
            IsLOH = oomData.IsLOH != 0,
            Reason = oomData.Reason,
            Size = oomData.Size,
        };
        return true;
    }
}

调用了 _sos.GetOOMData(out oomData)。_sos 是 DacHeap 的成员变量，来自 DacHeap 构造函数的第一个参数。

internal sealed class DacHeap : IAbstractHeap
{
    private readonly SOSDac _sos; //<--
    private readonly SOSDac8? _sos8;
    
    //...

    public DacHeap(SOSDac sos, SOSDac8? sos8, SosDac12? sos12, ISOSDac16? sos16, IMemoryReader reader, in GCInfo gcInfo, in CommonMethodTables commonMethodTables)
    {
        _sos = sos; //<--
        _sos8 = sos8;
        
        //...
    }
}

而 DacHeap 又是在 DacServiceProvider.GetService(Type) 中创建的，关键代码是

return _heapHelper = new DacHeap(_sos, _sos8, _sos12, _sos16, _dataReader, data, mts);

传递给 DacHeap 的第一个参数是 DacServiceProvider 的成员变量 _sos。该成员变量是在 DacServiceProvider 的构造函数中初始化的。构造函数如下：

internal class DacServiceProvider : IServiceProvider, IDisposable, IAbstractDacController
{
    private readonly ClrInfo _clrInfo;
    private readonly IDataReader _dataReader;

    private readonly DacLibrary _dac;
    private readonly ClrDataProcess _process;
    private readonly SOSDac _sos;
    private readonly SOSDac6? _sos6;

    //...

    public DacServiceProvider(ClrInfo clrInfo, DacLibrary library)
    {
        _clrInfo = clrInfo;
        _dataReader = _clrInfo.DataTarget.DataReader;

        _dac = library;
        _process = library.CreateClrDataProcess();
        _sos = _process.CreateSOSDacInterface() ?? throw new InvalidOperationException($"Could not create ISOSDacInterface."); //<--
        _sos6 = _process.CreateSOSDacInterface6();

        library.DacDataTarget.SetMagicCallback(_process.Flush);
        IsThreadSafe = _sos13 is not null || RuntimeInformation.IsOSPlatform(OSPlatform.Windows);
    }

    // ...
}

_sos 是由 _process.CreateSOSDacInterface() 创建的，而 _process 的类型是 ClrDataProcess，看一下 _process.CreateSOSDacInterface() 的实现，如下：

internal sealed unsafe class ClrDataProcess : CallableCOMWrapper
{
    private static readonly Guid IID_IXCLRDataProcess = new("5c552ab6-fc09-4cb3-8e36-22fa03c798b7");
    private readonly DacLibrary _library;

    //...

    public SOSDac? CreateSOSDacInterface()
    {
        IntPtr result = QueryInterface(SOSDac.IID_ISOSDac);
        if (result == IntPtr.Zero)
            return null;

        try
        {
            return new SOSDac(_library, result);
        }
        catch (InvalidOperationException)
        {
            return null;
        }
    }

    //...
}

该函数会返回 SOSDac 类型的对象，该类型构造函数的第二个参数是通过 QueryInterface(SOSDac.IID_ISOSDac) 得到的，SOSDac.IID_ISOSDac 的值是 436f00f2-b42a-4b9f-870c-e73db66ae930，是 SOSDac 类的静态变量，SOSDac 的定义如下：

internal sealed unsafe class SOSDac : CallableCOMWrapper
{
    internal static readonly Guid IID_ISOSDac = new("436f00f2-b42a-4b9f-870c-e73db66ae930");

    private readonly DacLibrary _library;
    private volatile Dictionary<int, string>? _regNames;
    private volatile Dictionary<ulong, string>? _frameNames;

    public SOSDac(DacLibrary? library, IntPtr ptr)
        : base(library?.OwningLibrary, IID_ISOSDac, ptr)
    {
        _library = library ?? throw new ArgumentNullException(nameof(library));
    }

    private ref readonly ISOSDacVTable VTable => ref Unsafe.AsRef<ISOSDacVTable>(_vtable);

    public SOSDac(DacLibrary lib, CallableCOMWrapper toClone) : base(toClone)
    {
        _library = lib;
    }

    //...

    public HResult GetOOMData(out DacOOMData oomData) => VTable.GetOOMStaticData(Self, out oomData);

    public HResult GetOOMData(ulong address, out DacOOMData oomData) => VTable.GetOOMData(Self, address, out oomData);
}

此类什么有用的事情都没做，都是调用 VTable 中的实现，而且其基类是 CallableCOMWrapper，可以大胆猜测此类是一个 COM 调用类，真正的实现在 native 层。是不是呢？到 native 层搜搜就知道了。

查看 clr runtime 实现

在 native 代码中搜索 436f00f2-b42a-4b9f-870c-e73db66ae930，可以在 sospriv.h 头文件中搜到。

MIDL_INTERFACE("436f00f2-b42a-4b9f-870c-e73db66ae930") //<---
ISOSDacInterface : public IUnknown
{
public:
    //...
    
    virtual HRESULT STDMETHODCALLTYPE GetOOMData( 
        CLRDATA_ADDRESS oomAddr,
        struct DacpOomData *data) = 0;
    
    virtual HRESULT STDMETHODCALLTYPE GetOOMStaticData( 
        struct DacpOomData *data) = 0;
    
    //...
};

继续搜索 ISOSDacInterface，可以在 daccess.cpp 中找到使用的地方，对应的实现类是 ClrDataAccess。

说明： 对应的声明文件在 D:\dotnet\runtime\src\coreclr\debug\daccess\dacimpl.h

//D:\dotnet\runtime\src\coreclr\debug\daccess\daccess.cpp

STDMETHODIMP
ClrDataAccess::QueryInterface(THIS_
                              IN REFIID interfaceId,
                              OUT PVOID* iface)
{
    void* ifaceRet;

    if (IsEqualIID(interfaceId, IID_IUnknown) ||
        IsEqualIID(interfaceId, __uuidof(IXCLRDataProcess)) ||
        IsEqualIID(interfaceId, __uuidof(IXCLRDataProcess2)))
    {
        ifaceRet = static_cast<IXCLRDataProcess2*>(this);
    }
    else if (IsEqualIID(interfaceId, __uuidof(ICLRDataEnumMemoryRegions)))
    {
        ifaceRet = static_cast<ICLRDataEnumMemoryRegions*>(this);
    }
    else if (IsEqualIID(interfaceId, __uuidof(ISOSDacInterface))) //<---
    {
        ifaceRet = static_cast<ISOSDacInterface*>(this);
    }
    else if (IsEqualIID(interfaceId, __uuidof(ISOSDacInterface2)))
    {
        ifaceRet = static_cast<ISOSDacInterface2*>(this);
    }
    //...
    
    AddRef();
    *iface = ifaceRet;
    return S_OK;
}

可以查看 ClrDataAccess::GetOOMData() 的具体实现，如下：

//D:\dotnet\runtime\src\coreclr\debug\daccess\request.cpp

HRESULT
ClrDataAccess::GetOOMData(CLRDATA_ADDRESS oomAddr, struct DacpOomData *data)
{
    if (oomAddr == 0 || data == NULL)
        return E_INVALIDARG;

    SOSDacEnter();
    *data = {};

    if (!GCHeapUtilities::IsServerHeap())
        hr = E_FAIL; // doesn't make sense to call this on WKS mode

#ifdef FEATURE_SVR_GC
    else
        hr = ServerOomData(oomAddr, data);
#else
    _ASSERTE_MSG(false, "IsServerHeap returned true but FEATURE_SVR_GC not defined");
    hr = E_NOTIMPL;
#endif //FEATURE_SVR_GC

    SOSDacLeave();
    return hr;
}

ClrDataAccess::ServerOomData() 的实现如下：

//D:\dotnet\runtime\src\coreclr\debug\daccess\request_svr.cpp
    
HRESULT
ClrDataAccess::ServerOomData(CLRDATA_ADDRESS addr, DacpOomData *oomData)
{
    TADDR heapAddress = TO_TADDR(addr);
    dac_gc_heap heap = LoadGcHeapData(heapAddress);
    dac_gc_heap* pHeap = &heap;

    oom_history pOOMInfo = pHeap->oom_info;
    oomData->reason = pOOMInfo.reason;
    oomData->alloc_size = pOOMInfo.alloc_size;
    oomData->available_pagefile_mb = pOOMInfo.available_pagefile_mb;
    oomData->gc_index = pOOMInfo.gc_index;
    oomData->fgm = pOOMInfo.fgm;
    oomData->size = pOOMInfo.size;
    oomData->loh_p = pOOMInfo.loh_p;

    return S_OK;
}

由以上代码可知，oomData 来自 pHeap->oom_info，看下 oom_info 的定义，如下

//D:\dotnet\runtime\src\coreclr\gc\gcpriv.h

class gc_heap
{
    //...
    PER_HEAP_FIELD_DIAG_ONLY oom_history oom_info;
    //...
}

其类型是 oom_history，查看定义，如下：

//D:\dotnet\runtime\src\coreclr\gc\gcinterface.dac.h

// Reasons why an OOM might occur, recorded in the oom_history
// struct below.
enum oom_reason
{
    oom_no_failure = 0,
    oom_budget = 1,
    oom_cant_commit = 2,
    oom_cant_reserve = 3,
    oom_loh = 4,
    oom_low_mem = 5,
    oom_unproductive_full_gc = 6
};

/*!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!*/
/* If you modify failure_get_memory and         */
/* oom_reason be sure to make the corresponding */
/* changes in ClrMD.                            */
/*!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!*/
enum failure_get_memory
{
    fgm_no_failure = 0,
    fgm_reserve_segment = 1,
    fgm_commit_segment_beg = 2,
    fgm_commit_eph_segment = 3,
    fgm_grow_table = 4,
    fgm_commit_table = 5
};

// A record of the last OOM that occurred in the GC, with some
// additional information as to what triggered the OOM.
struct oom_history
{
    oom_reason reason;
    size_t alloc_size;
    uint8_t* reserved;
    uint8_t* allocated;
    size_t gc_index;
    failure_get_memory fgm;
    size_t size;
    size_t available_pagefile_mb;
    BOOL loh_p;
};

看到以上定义就太亲切了。根据目前了解到的信息，这个结构体应该是当发生 OOM 时，runtime 设置的结构体。可以在代码中搜索使用 fgm_reserve_segment 的地方，一共就搜到两处，一处是其定义的地方，一处是使用的地方，使用的代码如下：

heap_segment*
gc_heap::get_segment (size_t size, gc_oh_num oh)
{
    assert(oh != gc_oh_num::unknown);
    BOOL uoh_p = (oh == gc_oh_num::loh) || (oh == gc_oh_num::poh);
    if (heap_hard_limit)
        return NULL;

    heap_segment* result = 0;

    if (segment_standby_list != 0)
    {
        result = segment_standby_list;
        heap_segment* last = 0;
        while (result)
        {
            size_t hs = (size_t)(heap_segment_reserved (result) - (uint8_t*)result);
            if ((hs >= size) && ((hs / 2) < size))
            {
                dprintf (2, ("Hoarded segment %zx found", (size_t) result));
                if (last)
                {
                    heap_segment_next (last) = heap_segment_next (result);
                }
                else
                {
                    segment_standby_list = heap_segment_next (result);
                }
                break;
            }
            else
            {
                last = result;
                result = heap_segment_next (result);
            }
        }
    }

    if (!result)
    {
        void* mem = virtual_alloc (size);
        if (!mem)
        {
            fgm_result.set_fgm (fgm_reserve_segment, size, uoh_p); //<---
            return 0;
        }
        
        //...
    }
    
    //...
}

可以发现，当 virtual_alloc (size) 的返回值是空时，会设置 fgm_reserve_segment。再看看 virtual_alloc 的实现，如下：


void* virtual_alloc (size_t size)
{
    return virtual_alloc(size, false);
}

void* virtual_alloc (size_t size, bool use_large_pages_p, uint16_t numa_node)
{
    size_t requested_size = size;

    if ((gc_heap::reserved_memory_limit - gc_heap::reserved_memory) < requested_size)
    {
        gc_heap::reserved_memory_limit =
            GCScan::AskForMoreReservedMemory (gc_heap::reserved_memory_limit, requested_size);
        if ((gc_heap::reserved_memory_limit - gc_heap::reserved_memory) < requested_size)
        {
            return 0; //<---
        }
    }

    uint32_t flags = VirtualReserveFlags::None;
#ifndef FEATURE_USE_SOFTWARE_WRITE_WATCH_FOR_GC_HEAP
    if (virtual_alloc_hardware_write_watch)
    {
        flags = VirtualReserveFlags::WriteWatch;
    }
#endif // !FEATURE_USE_SOFTWARE_WRITE_WATCH_FOR_GC_HEAP

    //<--- use_large_pages_p 是 false，会调用 GCToOSInterface::VirtualReserve
    void* prgmem = use_large_pages_p ?
        GCToOSInterface::VirtualReserveAndCommitLargePages(requested_size, numa_node) :
        GCToOSInterface::VirtualReserve(requested_size, card_size * card_word_width, flags, numa_node);
    void *aligned_mem = prgmem;

    // We don't want (prgmem + size) to be right at the end of the address space
    // because we'd have to worry about that everytime we do (address + size).
    // We also want to make sure that we leave loh_size_threshold at the end
    // so we allocate a small object we don't need to worry about overflow there
    // when we do alloc_ptr+size.
    if (prgmem)
    {
        uint8_t* end_mem = (uint8_t*)prgmem + requested_size;

        if ((end_mem == 0) || ((size_t)(MAX_PTR - end_mem) <= END_SPACE_AFTER_GC))
        {
            GCToOSInterface::VirtualRelease (prgmem, requested_size);
            dprintf (2, ("Virtual Alloc size %zd returned memory right against 4GB [%zx, %zx[ - discarding",
                        requested_size, (size_t)prgmem, (size_t)((uint8_t*)prgmem+requested_size)));
            prgmem = 0;
            aligned_mem = 0;
        }
    }

    if (prgmem)
    {
        gc_heap::reserved_memory += requested_size;
    }

    dprintf (2, ("Virtual Alloc size %zd: [%zx, %zx[",
                 requested_size, (size_t)prgmem, (size_t)((uint8_t*)prgmem+requested_size)));

    return aligned_mem;
}

以上代码，一共有三个地方会导致返回空，第一处代码如下：

if ((gc_heap::reserved_memory_limit - gc_heap::reserved_memory) < requested_size)
{
    gc_heap::reserved_memory_limit =
        GCScan::AskForMoreReservedMemory (gc_heap::reserved_memory_limit, requested_size);
    if ((gc_heap::reserved_memory_limit - gc_heap::reserved_memory) < requested_size)
    {
        return 0; //<---
    }
}

大概逻辑是，如果保留内存限值（gc_heap::reserved_memory_limit）- 已保留的内存（gc_heap::reserved_memory）小于 请求字节数（requested_size），就调用 GCScan::AskForMoreReservedMemory() 请求保留更多内存，该函数会返回新的限值。如果 新限值 - 已保留的内存 还是小于 请求字节数 就返回空。

第二处代码如下：

void* prgmem = use_large_pages_p ?
        GCToOSInterface::VirtualReserveAndCommitLargePages(requested_size, numa_node) :
        GCToOSInterface::VirtualReserve(requested_size, card_size * card_word_width, flags, numa_node);
    void *aligned_mem = prgmem;

由于，use_large_pages_p 是 false，会调用 GCToOSInterface::VirtualReserve()，该函数底层又会直接调用 VirtualAlloc()。

第三处代码如下：

if (prgmem)
{
    uint8_t* end_mem = (uint8_t*)prgmem + requested_size;

    if ((end_mem == 0) || ((size_t)(MAX_PTR - end_mem) <= END_SPACE_AFTER_GC))
    {
        GCToOSInterface::VirtualRelease (prgmem, requested_size);
        dprintf (2, ("Virtual Alloc size %zd returned memory right against 4GB [%zx, %zx[ - discarding",
                     requested_size, (size_t)prgmem, (size_t)((uint8_t*)prgmem+requested_size)));
        prgmem = 0;
        aligned_mem = 0;
    }
}

MAX_PTR 为最大的无符号整数，end_mem 是此次分配的内存段的结束位置，如果结束位置后面的空间不能容纳大对象堆，也返回空。

至此，本次折腾就告一段路了，第一张图片中的报错信息，基本上是 virtual_alloc 失败导致的问题。为什么 virtual_alloc 会失败，我到现在也没想明白。因为只尝试保留内存空间，并没有进行提交，按理说在有足够大的空闲内存空间时，不应该失败才对。什么情况下 VirtualAlloc() 会失败，还望各位大牛不吝赐教！

总结

再次强烈推荐一下 FileLocator 文件内容搜索神器，你值得拥有

参考资料

.net 源码