Two weeks ago we migrated into SQL Server 2016 SP1 CU3 (previously our app was using SQL Server 2014 SP2). The application hasn't been upgraded or changed nor the workload has changed.
Since the migration, every few minutes up to more or less one hour, the proc cache is being flushed (not entirely, but still the majority of plans go away). If I run:
- SELECT count (*) FROM sys.dm_exec_cached_plans
-
...just after clearing happens, the number of plans drops to 1 up to 3 hundred and then gradually increase to more or less 2,000, then usually clearing happens again and so on. It worth to mention that Buffer Pool seems to stay intact and only caches
are affected.
The server runs on VMWare, it has 128 GB of RAM (SQL Server max server memory is set to 102 GB, min. server memory is set to 72 GB).
Based on the output from SentryOne, I can see that buffer pool consumes ~61 GB.
![SQL Server memory usage]()
My SQL Server version is as follows:
Microsoft SQL Server 2016 (SP1-CU3) (KB4019916) - 13.0.4435.0 (X64)
It's a Standard Edition.
Finally, I came across an article by
Jonathan Kehayias and I decided to check ring buffers and boom!, it turns out that I have notifications from resource monitors saying: 'low physical memory'. Occurrences of this notification fit perfectly to the proc cache clearing. Now
the case is how to interpret these results and how to find the responsible process. As you can find in the query result:
- SELECT
- EventTime,
- record.value('(/Record/ResourceMonitor/Notification)[1]','varchar(max)')as[Type],
- record.value('(/Record/ResourceMonitor/IndicatorsProcess)[1]','int')as[IndicatorsProcess],
- record.value('(/Record/ResourceMonitor/IndicatorsSystem)[1]','int')as[IndicatorsSystem],
- record.value('(/Record/MemoryRecord/AvailablePhysicalMemory)[1]','bigint') AS [AvailPhysMem,Kb],
- record.value('(/Record/MemoryRecord/AvailableVirtualAddressSpace)[1]','bigint') AS [Avail VAS,Kb] FROM (
- SELECT
- DATEADD (ss,(-1*((cpu_ticks / CONVERT (float,( cpu_ticks / ms_ticks )))-[timestamp])/1000), GETDATE()) AS EventTime,
- CONVERT (xml, record) AS record
- FROM sys.dm_os_ring_buffers
- CROSS JOIN sys.dm_os_sys_info
- WHERE ring_buffer_type ='RING_BUFFER_RESOURCE_MONITOR') AS tab ORDER BY EventTime DESC;
-
![Ring buffer query results]()
we can observe 'low physical memory' flag although the value of Available Physical memory stays at the same level. Moreover, if I am not mistaken above results are indicating internal memory pressure (IndicatorsProcess = 2, which is weird to me
as Sentry all the time shows that SQL Server doesn't fully utilize the allocated memory. This is memory usage captured by Sentry for a sample taken at 8 AM:
![SQL Server memory usage]()
All lines are pretty flat. What is also weird to me is that group of events:
happens at the same time. So this pressure takes milliseconds or less (perhaps this is also the reason why Sentry doesn't capture anything as it collects data a way less frequently).
I tried to find the reason behind this internal pressure and I checked top 10 memory clerks (in terms of memory consumption) to see if there are any heavy consumers there:
![Memory clerks query results]()
but to be fair I don't see anything suspicious there.
Another thing worth to say is I haven't tried LPIM yet, as it requires SQL Server to be restarted, but even if it's the solution I would really like to understand why this issue happens. Moreover, please correct me if I am wrong, but as buffer pool
seems not to be affected by trimming I don't really think LPIM is a solution here.
Now I am completely lost and I don't really know what else should I check in order to find the root cause of the issue. I would really appreciate if some can help me to solve this puzzle.