(tldr: the question itself is at the bottom)
I've read that on AMD family 17h processors (Zen-Zen2, although it might be the case with the following generations as well, but I am not familiar with them) the L2 cache is inclusive of the L1 cache, while the L3 cache is not.
Sadly, I was not able to glean any more details from AMD's manuals, Vol. 2 (Rev3.41) is the only one talking about it and just says "The L2 cache can be exclusive, meaning it does not cache information contained in the L1 cache. Conversely, inclusive L2 caches contain a copy of the L1-cached information."
This is confirmed, however, by the readout of CPUID on my Zen machine:
$ cpuid # the following reports the Core::X86::Cpuid::CachePropEdx0-3 properties
--- cache 2 ---
[...]
cache inclusive of lower levels = true
--- cache 3 ---
[...]
cache inclusive of lower levels = false
Intel's CPUs, on the other hand, offer an inclusive LLC (L3), as is specified many times throughout all the volumes of the Architecture Developer's manual (interestingly, however, Intel does offer some versions of its processors without an inclusive L3 cache, like Skylake Server, see pp.77-80 of Optimization Ref Manual Document #248966-049US)
Confirmed by the readout of CPUID on my Kaby Lake machine:
$ cpuid # the following reports the Deterministic Cache Parameters leaf
--- cache 2 ---
[...]
inclusive to lower caches = false
--- cache 3 ---
[...]
inclusive to lower caches = true
See also, from these 2010 slides: "AMD processors tend to have exclusive caches; Intel processors tend to have inclusive caches"
In my understanding and I apologize if I am confused here, in Intel's implementation, L3 serves as a global directory/replacement for the snooping algorithm, and requests for lines from cores can determine if the cache line is present on any other cores from the additional information maintained by the L3, because it is inclusive of all these lines
As Intel's Optimization manual follows:
B.4.5.3 Global Queue Snoop Events
Cacheline requests from the cores or from a remote package or the I/O Hub are handled by the GQ. When the uncore receives a cacheline request from one of the cores, the GQ first checks the L3 to see if the line is on the package. Because the L3 is inclusive, this answer can be quickly ascertained. If the line is in the L3 and was owned by the requesting core, data can be returned to the core from the L3 directly. If the line is being used by multiple cores, the GQ will snoop the other cores to see if there is a modified copy. If so the L3 is updated and the line is sent to the requesting core.
See also, from Optimization manual Vol.2 Document #: 356477-002US:
The LLC is inclusive of all cache levels above it - data contained in the core caches must also reside in the LLC. Each cache line in the LLC holds an indication of the cores that may have this line in their L2 and L1 caches. If there is an indication in the LLC that other cores may hold the line of interest and its state might have to modify, there is a lookup into the L1 DCache and L2 of these cores too. The lookup is called “clean” if it does not require fetching data from the other core caches. The lookup is called “dirty” if modified data has to be fetched from the other core caches and transferred to the loading core.
In the section on Nehalem in ibid.: "New cache hierarchy organization with shared, inclusive L3 to reduce snoop traffic." and "The shared L3 cache is writeback and inclusive, such that a cache line that exists in either L1 data cache, L1 instruction cache, unified L2 cache also exists in L3. The L3 is designed to use the inclusive nature to minimize snoop traffic between processor cores" and so on and so forth.
What is the reasoning behind AMD's decision to not make L3 inclusive? Does this mean that "snooping traffic" between cores is present to a much larger extent in their processors as opposed to Intel's? I know the newer generations of their processors certainly do not lack the capacity in the LLC with the 3D v-caches and all, is it just a matter of preference for how Intel/AMD organize their cache coherence algorithm? Or is it Intel that's the outlier here?