Commit 06cff4a5 authored by Linus Torvalds's avatar Linus Torvalds
Browse files
Pull arm64 updates from Will Deacon:
 "The highlights this time are support for dynamically enabling and
  disabling Clang's Shadow Call Stack at boot and a long-awaited
  optimisation to the way in which we handle the SVE register state on
  system call entry to avoid taking unnecessary traps from userspace.

  Summary:

  ACPI:
   - Enable FPDT support for boot-time profiling
   - Fix CPU PMU probing to work better with PREEMPT_RT
   - Update SMMUv3 MSI DeviceID parsing to latest IORT spec
   - APMT support for probing Arm CoreSight PMU devices

  CPU features:
   - Advertise new SVE instructions (v2.1)
   - Advertise range prefetch instruction
   - Advertise CSSC ("Common Short Sequence Compression") scalar
     instructions, adding things like min, max, abs, popcount
   - Enable DIT (Data Independent Timing) when running in the kernel
   - More conversion of system register fields over to the generated
     header

  CPU misfeatures:
   - Workaround for Cortex-A715 erratum #2645198

  Dynamic SCS:
   - Support for dynamic shadow call stacks to allow switching at
     runtime between Clang's SCS implementation and the CPU's pointer
     authentication feature when it is supported (complete with scary
     DWARF parser!)

  Tracing and debug:
   - Remove static ftrace in favour of, err, dynamic ftrace!
   - Seperate 'struct ftrace_regs' from 'struct pt_regs' in core ftrace
     and existing arch code
   - Introduce and implement FTRACE_WITH_ARGS on arm64 to replace the
     old FTRACE_WITH_REGS
   - Extend 'crashkernel=' parameter with default value and fallback to
     placement above 4G physical if initial (low) allocation fails

  SVE:
   - Optimisation to avoid disabling SVE unconditionally on syscall
     entry and just zeroing the non-shared state on return instead

  Exceptions:
   - Rework of undefined instruction handling to avoid serialisation on
     global lock (this includes emulation of user accesses to the ID
     registers)

  Perf and PMU:
   - Support for TLP filters in Hisilicon's PCIe PMU device
   - Support for the DDR PMU present in Amlogic Meson G12 SoCs
   - Support for the terribly-named "CoreSight PMU" architecture from
     Arm (and Nvidia's implementation of said architecture)

  Misc:
   - Tighten up our boot protocol for systems with memory above 52 bits
     physical
   - Const-ify static keys to satisty jump label asm constraints
   - Trivial FFA driver cleanups in preparation for v1.1 support
   - Export the kernel_neon_* APIs as GPL symbols
   - Harden our instruction generation routines against instrumentation
   - A bunch of robustness improvements to our arch-specific selftests
   - Minor cleanups and fixes all over (kbuild, kprobes, kfence, PMU, ...)"

* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (151 commits)
  arm64: kprobes: Return DBG_HOOK_ERROR if kprobes can not handle a BRK
  arm64: kprobes: Let arch do_page_fault() fix up page fault in user handler
  arm64: Prohibit instrumentation on arch_stack_walk()
  arm64:uprobe fix the uprobe SWBP_INSN in big-endian
  arm64: alternatives: add __init/__initconst to some functions/variables
  arm_pmu: Drop redundant armpmu->map_event() in armpmu_event_init()
  kselftest/arm64: Allow epoll_wait() to return more than one result
  kselftest/arm64: Don't drain output while spawning children
  kselftest/arm64: Hold fp-stress children until they're all spawned
  arm64/sysreg: Remove duplicate definitions from asm/sysreg.h
  arm64/sysreg: Convert ID_DFR1_EL1 to automatic generation
  arm64/sysreg: Convert ID_DFR0_EL1 to automatic generation
  arm64/sysreg: Convert ID_AFR0_EL1 to automatic generation
  arm64/sysreg: Convert ID_MMFR5_EL1 to automatic generation
  arm64/sysreg: Convert MVFR2_EL1 to automatic generation
  arm64/sysreg: Convert MVFR1_EL1 to automatic generation
  arm64/sysreg: Convert MVFR0_EL1 to automatic generation
  arm64/sysreg: Convert ID_PFR2_EL1 to automatic generation
  arm64/sysreg: Convert ID_PFR1_EL1 to automatic generation
  arm64/sysreg: Convert ID_PFR0_EL1 to automatic generation
  ...
parents 164f5900 5f4c3747
Loading
Loading
Loading
Loading
+6 −9
Original line number Diff line number Diff line
@@ -831,7 +831,7 @@
			memory region [offset, offset + size] for that kernel
			image. If '@offset' is omitted, then a suitable offset
			is selected automatically.
			[KNL, X86-64] Select a region under 4G first, and
			[KNL, X86-64, ARM64] Select a region under 4G first, and
			fall back to reserve region above 4G when '@offset'
			hasn't been specified.
			See Documentation/admin-guide/kdump/kdump.rst for further details.
@@ -851,26 +851,23 @@
			available.
			It will be ignored if crashkernel=X is specified.
	crashkernel=size[KMG],low
			[KNL, X86-64] range under 4G. When crashkernel=X,high
			[KNL, X86-64, ARM64] range under 4G. When crashkernel=X,high
			is passed, kernel could allocate physical memory region
			above 4G, that cause second kernel crash on system
			that require some amount of low memory, e.g. swiotlb
			requires at least 64M+32K low memory, also enough extra
			low memory is needed to make sure DMA buffers for 32-bit
			devices won't run out. Kernel would try to allocate
			at least 256M below 4G automatically.
			default	size of memory below 4G automatically. The default
			size is	platform dependent.
			  --> x86: max(swiotlb_size_or_default() + 8MiB, 256MiB)
			  --> arm64: 128MiB
			This one lets the user specify own low range under 4G
			for second kernel instead.
			0: to disable low allocation.
			It will be ignored when crashkernel=X,high is not used
			or memory reserved is below 4G.

			[KNL, ARM64] range in low memory.
			This one lets the user specify a low range in the
			DMA zone for the crash dump kernel.
			It will be ignored when crashkernel=X,high is not used
			or memory reserved is located in the DMA zones.

	cryptomgr.notests
			[KNL] Disable crypto self-tests

+68 −44
Original line number Diff line number Diff line
@@ -15,10 +15,10 @@ HiSilicon PCIe PMU driver
The PCIe PMU driver registers a perf PMU with the name of its sicl-id and PCIe
Core id.::

  /sys/bus/event_source/hisi_pcie<sicl>_<core>
  /sys/bus/event_source/hisi_pcie<sicl>_core<core>

PMU driver provides description of available events and filter options in sysfs,
see /sys/bus/event_source/devices/hisi_pcie<sicl>_<core>.
see /sys/bus/event_source/devices/hisi_pcie<sicl>_core<core>.

The "format" directory describes all formats of the config (events) and config1
(filter options) fields of the perf_event_attr structure. The "events" directory
@@ -33,13 +33,13 @@ monitored by PMU.
Example usage of perf::

  $# perf list
  hisi_pcie0_0/rx_mwr_latency/ [kernel PMU event]
  hisi_pcie0_0/rx_mwr_cnt/ [kernel PMU event]
  hisi_pcie0_core0/rx_mwr_latency/ [kernel PMU event]
  hisi_pcie0_core0/rx_mwr_cnt/ [kernel PMU event]
  ------------------------------------------

  $# perf stat -e hisi_pcie0_0/rx_mwr_latency/
  $# perf stat -e hisi_pcie0_0/rx_mwr_cnt/
  $# perf stat -g -e hisi_pcie0_0/rx_mwr_latency/ -e hisi_pcie0_0/rx_mwr_cnt/
  $# perf stat -e hisi_pcie0_core0/rx_mwr_latency/
  $# perf stat -e hisi_pcie0_core0/rx_mwr_cnt/
  $# perf stat -g -e hisi_pcie0_core0/rx_mwr_latency/ -e hisi_pcie0_core0/rx_mwr_cnt/

The current driver does not support sampling. So "perf record" is unsupported.
Also attach to a task is unsupported for PCIe PMU.
@@ -48,41 +48,46 @@ Filter options
--------------

1. Target filter
PMU could only monitor the performance of traffic downstream target Root Ports
or downstream target Endpoint. PCIe PMU driver support "port" and "bdf"
interfaces for users, and these two interfaces aren't supported at the same
time.

   PMU could only monitor the performance of traffic downstream target Root
   Ports or downstream target Endpoint. PCIe PMU driver support "port" and
   "bdf" interfaces for users, and these two interfaces aren't supported at the
   same time.

   - port

     "port" filter can be used in all PCIe PMU events, target Root Port can be
selected by configuring the 16-bits-bitmap "port". Multi ports can be selected
for AP-layer-events, and only one port can be selected for TL/DL-layer-events.
     selected by configuring the 16-bits-bitmap "port". Multi ports can be
     selected for AP-layer-events, and only one port can be selected for
     TL/DL-layer-events.

For example, if target Root Port is 0000:00:00.0 (x8 lanes), bit0 of bitmap
should be set, port=0x1; if target Root Port is 0000:00:04.0 (x4 lanes),
bit8 is set, port=0x100; if these two Root Ports are both monitored, port=0x101.
     For example, if target Root Port is 0000:00:00.0 (x8 lanes), bit0 of
     bitmap should be set, port=0x1; if target Root Port is 0000:00:04.0 (x4
     lanes), bit8 is set, port=0x100; if these two Root Ports are both
     monitored, port=0x101.

     Example usage of perf::

  $# perf stat -e hisi_pcie0_0/rx_mwr_latency,port=0x1/ sleep 5
       $# perf stat -e hisi_pcie0_core0/rx_mwr_latency,port=0x1/ sleep 5

   - bdf

"bdf" filter can only be used in bandwidth events, target Endpoint is selected
by configuring BDF to "bdf". Counter only counts the bandwidth of message
requested by target Endpoint.
     "bdf" filter can only be used in bandwidth events, target Endpoint is
     selected by configuring BDF to "bdf". Counter only counts the bandwidth of
     message requested by target Endpoint.

     For example, "bdf=0x3900" means BDF of target Endpoint is 0000:39:00.0.

     Example usage of perf::

  $# perf stat -e hisi_pcie0_0/rx_mrd_flux,bdf=0x3900/ sleep 5
       $# perf stat -e hisi_pcie0_core0/rx_mrd_flux,bdf=0x3900/ sleep 5

2. Trigger filter

   Event statistics start when the first time TLP length is greater/smaller
than trigger condition. You can set the trigger condition by writing "trig_len",
and set the trigger mode by writing "trig_mode". This filter can only be used
in bandwidth events.
   than trigger condition. You can set the trigger condition by writing
   "trig_len", and set the trigger mode by writing "trig_mode". This filter can
   only be used in bandwidth events.

   For example, "trig_len=4" means trigger condition is 2^4 DW, "trig_mode=0"
   means statistics start when TLP length > trigger condition, "trig_mode=1"
@@ -90,9 +95,10 @@ means start when TLP length < condition.

   Example usage of perf::

  $# perf stat -e hisi_pcie0_0/rx_mrd_flux,trig_len=0x4,trig_mode=1/ sleep 5
     $# perf stat -e hisi_pcie0_core0/rx_mrd_flux,trig_len=0x4,trig_mode=1/ sleep 5

3. Threshold filter

   Counter counts when TLP length within the specified range. You can set the
   threshold by writing "thr_len", and set the threshold mode by writing
   "thr_mode". This filter can only be used in bandwidth events.
@@ -103,4 +109,22 @@ when TLP length < threshold.

   Example usage of perf::

  $# perf stat -e hisi_pcie0_0/rx_mrd_flux,thr_len=0x4,thr_mode=1/ sleep 5
     $# perf stat -e hisi_pcie0_core0/rx_mrd_flux,thr_len=0x4,thr_mode=1/ sleep 5

4. TLP Length filter

   When counting bandwidth, the data can be composed of certain parts of TLP
   packets. You can specify it through "len_mode":

   - 2'b00: Reserved (Do not use this since the behaviour is undefined)
   - 2'b01: Bandwidth of TLP payloads
   - 2'b10: Bandwidth of TLP headers
   - 2'b11: Bandwidth of both TLP payloads and headers

   For example, "len_mode=2" means only counting the bandwidth of TLP headers
   and "len_mode=3" means the final bandwidth data is composed of both TLP
   headers and payloads. Default value if not specified is 2'b11.

   Example usage of perf::

     $# perf stat -e hisi_pcie0_core0/rx_mrd_flux,len_mode=0x1/ sleep 5
+2 −0
Original line number Diff line number Diff line
@@ -19,3 +19,5 @@ Performance monitor support
   arm_dsu_pmu
   thunderx2-pmu
   alibaba_pmu
   nvidia-pmu
   meson-ddr-pmu
+70 −0
Original line number Diff line number Diff line
.. SPDX-License-Identifier: GPL-2.0

===========================================================
Amlogic SoC DDR Bandwidth Performance Monitoring Unit (PMU)
===========================================================

The Amlogic Meson G12 SoC contains a bandwidth monitor inside DRAM controller.
The monitor includes 4 channels. Each channel can count the request accessing
DRAM. The channel can count up to 3 AXI port simultaneously. It can be helpful
to show if the performance bottleneck is on DDR bandwidth.

Currently, this driver supports the following 5 perf events:

+ meson_ddr_bw/total_rw_bytes/
+ meson_ddr_bw/chan_1_rw_bytes/
+ meson_ddr_bw/chan_2_rw_bytes/
+ meson_ddr_bw/chan_3_rw_bytes/
+ meson_ddr_bw/chan_4_rw_bytes/

meson_ddr_bw/chan_{1,2,3,4}_rw_bytes/ events are channel-specific events.
Each channel support filtering, which can let the channel to monitor
individual IP module in SoC.

Below are DDR access request event filter keywords:

+ arm             - from CPU
+ vpu_read1       - from OSD + VPP read
+ gpu             - from 3D GPU
+ pcie            - from PCIe controller
+ hdcp            - from HDCP controller
+ hevc_front      - from HEVC codec front end
+ usb3_0          - from USB3.0 controller
+ hevc_back       - from HEVC codec back end
+ h265enc         - from HEVC encoder
+ vpu_read2       - from DI read
+ vpu_write1      - from VDIN write
+ vpu_write2      - from di write
+ vdec            - from legacy codec video decoder
+ hcodec          - from H264 encoder
+ ge2d            - from ge2d
+ spicc1          - from SPI controller 1
+ usb0            - from USB2.0 controller 0
+ dma             - from system DMA controller 1
+ arb0            - from arb0
+ sd_emmc_b       - from SD eMMC b controller
+ usb1            - from USB2.0 controller 1
+ audio           - from Audio module
+ sd_emmc_c       - from SD eMMC c controller
+ spicc2          - from SPI controller 2
+ ethernet        - from Ethernet controller


Examples:

  + Show the total DDR bandwidth per seconds:

    .. code-block:: bash

       perf stat -a -e meson_ddr_bw/total_rw_bytes/ -I 1000 sleep 10


  + Show individual DDR bandwidth from CPU and GPU respectively, as well as
    sum of them:

    .. code-block:: bash

       perf stat -a -e meson_ddr_bw/chan_1_rw_bytes,arm=1/ -I 1000 sleep 10
       perf stat -a -e meson_ddr_bw/chan_2_rw_bytes,gpu=1/ -I 1000 sleep 10
       perf stat -a -e meson_ddr_bw/chan_3_rw_bytes,arm=1,gpu=1/ -I 1000 sleep 10
+299 −0
Original line number Diff line number Diff line
=========================================================
NVIDIA Tegra SoC Uncore Performance Monitoring Unit (PMU)
=========================================================

The NVIDIA Tegra SoC includes various system PMUs to measure key performance
metrics like memory bandwidth, latency, and utilization:

* Scalable Coherency Fabric (SCF)
* NVLink-C2C0
* NVLink-C2C1
* CNVLink
* PCIE

PMU Driver
----------

The PMUs in this document are based on ARM CoreSight PMU Architecture as
described in document: ARM IHI 0091. Since this is a standard architecture, the
PMUs are managed by a common driver "arm-cs-arch-pmu". This driver describes
the available events and configuration of each PMU in sysfs. Please see the
sections below to get the sysfs path of each PMU. Like other uncore PMU drivers,
the driver provides "cpumask" sysfs attribute to show the CPU id used to handle
the PMU event. There is also "associated_cpus" sysfs attribute, which contains a
list of CPUs associated with the PMU instance.

.. _SCF_PMU_Section:

SCF PMU
-------

The SCF PMU monitors system level cache events, CPU traffic, and
strongly-ordered (SO) PCIE write traffic to local/remote memory. Please see
:ref:`NVIDIA_Uncore_PMU_Traffic_Coverage_Section` for more info about the PMU
traffic coverage.

The events and configuration options of this PMU device are described in sysfs,
see /sys/bus/event_sources/devices/nvidia_scf_pmu_<socket-id>.

Example usage:

* Count event id 0x0 in socket 0::

   perf stat -a -e nvidia_scf_pmu_0/event=0x0/

* Count event id 0x0 in socket 1::

   perf stat -a -e nvidia_scf_pmu_1/event=0x0/

NVLink-C2C0 PMU
--------------------

The NVLink-C2C0 PMU monitors incoming traffic from a GPU/CPU connected with
NVLink-C2C (Chip-2-Chip) interconnect. The type of traffic captured by this PMU
varies dependent on the chip configuration:

* NVIDIA Grace Hopper Superchip: Hopper GPU is connected with Grace SoC.

  In this config, the PMU captures GPU ATS translated or EGM traffic from the GPU.

* NVIDIA Grace CPU Superchip: two Grace CPU SoCs are connected.

  In this config, the PMU captures read and relaxed ordered (RO) writes from
  PCIE device of the remote SoC.

Please see :ref:`NVIDIA_Uncore_PMU_Traffic_Coverage_Section` for more info about
the PMU traffic coverage.

The events and configuration options of this PMU device are described in sysfs,
see /sys/bus/event_sources/devices/nvidia_nvlink_c2c0_pmu_<socket-id>.

Example usage:

* Count event id 0x0 from the GPU/CPU connected with socket 0::

   perf stat -a -e nvidia_nvlink_c2c0_pmu_0/event=0x0/

* Count event id 0x0 from the GPU/CPU connected with socket 1::

   perf stat -a -e nvidia_nvlink_c2c0_pmu_1/event=0x0/

* Count event id 0x0 from the GPU/CPU connected with socket 2::

   perf stat -a -e nvidia_nvlink_c2c0_pmu_2/event=0x0/

* Count event id 0x0 from the GPU/CPU connected with socket 3::

   perf stat -a -e nvidia_nvlink_c2c0_pmu_3/event=0x0/

NVLink-C2C1 PMU
-------------------

The NVLink-C2C1 PMU monitors incoming traffic from a GPU connected with
NVLink-C2C (Chip-2-Chip) interconnect. This PMU captures untranslated GPU
traffic, in contrast with NvLink-C2C0 PMU that captures ATS translated traffic.
Please see :ref:`NVIDIA_Uncore_PMU_Traffic_Coverage_Section` for more info about
the PMU traffic coverage.

The events and configuration options of this PMU device are described in sysfs,
see /sys/bus/event_sources/devices/nvidia_nvlink_c2c1_pmu_<socket-id>.

Example usage:

* Count event id 0x0 from the GPU connected with socket 0::

   perf stat -a -e nvidia_nvlink_c2c1_pmu_0/event=0x0/

* Count event id 0x0 from the GPU connected with socket 1::

   perf stat -a -e nvidia_nvlink_c2c1_pmu_1/event=0x0/

* Count event id 0x0 from the GPU connected with socket 2::

   perf stat -a -e nvidia_nvlink_c2c1_pmu_2/event=0x0/

* Count event id 0x0 from the GPU connected with socket 3::

   perf stat -a -e nvidia_nvlink_c2c1_pmu_3/event=0x0/

CNVLink PMU
---------------

The CNVLink PMU monitors traffic from GPU and PCIE device on remote sockets
to local memory. For PCIE traffic, this PMU captures read and relaxed ordered
(RO) write traffic. Please see :ref:`NVIDIA_Uncore_PMU_Traffic_Coverage_Section`
for more info about the PMU traffic coverage.

The events and configuration options of this PMU device are described in sysfs,
see /sys/bus/event_sources/devices/nvidia_cnvlink_pmu_<socket-id>.

Each SoC socket can be connected to one or more sockets via CNVLink. The user can
use "rem_socket" bitmap parameter to select the remote socket(s) to monitor.
Each bit represents the socket number, e.g. "rem_socket=0xE" corresponds to
socket 1 to 3.
/sys/bus/event_sources/devices/nvidia_cnvlink_pmu_<socket-id>/format/rem_socket
shows the valid bits that can be set in the "rem_socket" parameter.

The PMU can not distinguish the remote traffic initiator, therefore it does not
provide filter to select the traffic source to monitor. It reports combined
traffic from remote GPU and PCIE devices.

Example usage:

* Count event id 0x0 for the traffic from remote socket 1, 2, and 3 to socket 0::

   perf stat -a -e nvidia_cnvlink_pmu_0/event=0x0,rem_socket=0xE/

* Count event id 0x0 for the traffic from remote socket 0, 2, and 3 to socket 1::

   perf stat -a -e nvidia_cnvlink_pmu_1/event=0x0,rem_socket=0xD/

* Count event id 0x0 for the traffic from remote socket 0, 1, and 3 to socket 2::

   perf stat -a -e nvidia_cnvlink_pmu_2/event=0x0,rem_socket=0xB/

* Count event id 0x0 for the traffic from remote socket 0, 1, and 2 to socket 3::

   perf stat -a -e nvidia_cnvlink_pmu_3/event=0x0,rem_socket=0x7/


PCIE PMU
------------

The PCIE PMU monitors all read/write traffic from PCIE root ports to
local/remote memory. Please see :ref:`NVIDIA_Uncore_PMU_Traffic_Coverage_Section`
for more info about the PMU traffic coverage.

The events and configuration options of this PMU device are described in sysfs,
see /sys/bus/event_sources/devices/nvidia_pcie_pmu_<socket-id>.

Each SoC socket can support multiple root ports. The user can use
"root_port" bitmap parameter to select the port(s) to monitor, i.e.
"root_port=0xF" corresponds to root port 0 to 3.
/sys/bus/event_sources/devices/nvidia_pcie_pmu_<socket-id>/format/root_port
shows the valid bits that can be set in the "root_port" parameter.

Example usage:

* Count event id 0x0 from root port 0 and 1 of socket 0::

   perf stat -a -e nvidia_pcie_pmu_0/event=0x0,root_port=0x3/

* Count event id 0x0 from root port 0 and 1 of socket 1::

   perf stat -a -e nvidia_pcie_pmu_1/event=0x0,root_port=0x3/

.. _NVIDIA_Uncore_PMU_Traffic_Coverage_Section:

Traffic Coverage
----------------

The PMU traffic coverage may vary dependent on the chip configuration:

* **NVIDIA Grace Hopper Superchip**: Hopper GPU is connected with Grace SoC.

  Example configuration with two Grace SoCs::

   *********************************          *********************************
   * SOCKET-A                      *          * SOCKET-B                      *
   *                               *          *                               *
   *                     ::::::::  *          *  ::::::::                     *
   *                     : PCIE :  *          *  : PCIE :                     *
   *                     ::::::::  *          *  ::::::::                     *
   *                         |     *          *      |                        *
   *                         |     *          *      |                        *
   *  :::::::            ::::::::: *          *  :::::::::            ::::::: *
   *  :     :            :       : *          *  :       :            :     : *
   *  : GPU :<--NVLink-->: Grace :<---CNVLink--->: Grace :<--NVLink-->: GPU : *
   *  :     :    C2C     :  SoC  : *          *  :  SoC  :    C2C     :     : *
   *  :::::::            ::::::::: *          *  :::::::::            ::::::: *
   *     |                   |     *          *      |                   |    *
   *     |                   |     *          *      |                   |    *
   *  &&&&&&&&           &&&&&&&&  *          *   &&&&&&&&           &&&&&&&& *
   *  & GMEM &           & CMEM &  *          *   & CMEM &           & GMEM & *
   *  &&&&&&&&           &&&&&&&&  *          *   &&&&&&&&           &&&&&&&& *
   *                               *          *                               *
   *********************************          *********************************

   GMEM = GPU Memory (e.g. HBM)
   CMEM = CPU Memory (e.g. LPDDR5X)

  |
  | Following table contains traffic coverage of Grace SoC PMU in socket-A:

  ::

   +--------------+-------+-----------+-----------+-----+----------+----------+
   |              |                        Source                             |
   +              +-------+-----------+-----------+-----+----------+----------+
   | Destination  |       |GPU ATS    |GPU Not-ATS|     | Socket-B | Socket-B |
   |              |PCI R/W|Translated,|Translated | CPU | CPU/PCIE1| GPU/PCIE2|
   |              |       |EGM        |           |     |          |          |
   +==============+=======+===========+===========+=====+==========+==========+
   | Local        | PCIE  |NVLink-C2C0|NVLink-C2C1| SCF | SCF PMU  | CNVLink  |
   | SYSRAM/CMEM  | PMU   |PMU        |PMU        | PMU |          | PMU      |
   +--------------+-------+-----------+-----------+-----+----------+----------+
   | Local GMEM   | PCIE  |    N/A    |NVLink-C2C1| SCF | SCF PMU  | CNVLink  |
   |              | PMU   |           |PMU        | PMU |          | PMU      |
   +--------------+-------+-----------+-----------+-----+----------+----------+
   | Remote       | PCIE  |NVLink-C2C0|NVLink-C2C1| SCF |          |          |
   | SYSRAM/CMEM  | PMU   |PMU        |PMU        | PMU |   N/A    |   N/A    |
   | over CNVLink |       |           |           |     |          |          |
   +--------------+-------+-----------+-----------+-----+----------+----------+
   | Remote GMEM  | PCIE  |NVLink-C2C0|NVLink-C2C1| SCF |          |          |
   | over CNVLink | PMU   |PMU        |PMU        | PMU |   N/A    |   N/A    |
   +--------------+-------+-----------+-----------+-----+----------+----------+

   PCIE1 traffic represents strongly ordered (SO) writes.
   PCIE2 traffic represents reads and relaxed ordered (RO) writes.

* **NVIDIA Grace CPU Superchip**: two Grace CPU SoCs are connected.

  Example configuration with two Grace SoCs::

   *******************             *******************
   * SOCKET-A        *             * SOCKET-B        *
   *                 *             *                 *
   *    ::::::::     *             *    ::::::::     *
   *    : PCIE :     *             *    : PCIE :     *
   *    ::::::::     *             *    ::::::::     *
   *        |        *             *        |        *
   *        |        *             *        |        *
   *    :::::::::    *             *    :::::::::    *
   *    :       :    *             *    :       :    *
   *    : Grace :<--------NVLink------->: Grace :    *
   *    :  SoC  :    *     C2C     *    :  SoC  :    *
   *    :::::::::    *             *    :::::::::    *
   *        |        *             *        |        *
   *        |        *             *        |        *
   *     &&&&&&&&    *             *     &&&&&&&&    *
   *     & CMEM &    *             *     & CMEM &    *
   *     &&&&&&&&    *             *     &&&&&&&&    *
   *                 *             *                 *
   *******************             *******************

   GMEM = GPU Memory (e.g. HBM)
   CMEM = CPU Memory (e.g. LPDDR5X)

  |
  | Following table contains traffic coverage of Grace SoC PMU in socket-A:

  ::

   +-----------------+-----------+---------+----------+-------------+
   |                 |                      Source                  |
   +                 +-----------+---------+----------+-------------+
   | Destination     |           |         | Socket-B | Socket-B    |
   |                 |  PCI R/W  |   CPU   | CPU/PCIE1| PCIE2       |
   |                 |           |         |          |             |
   +=================+===========+=========+==========+=============+
   | Local           |  PCIE PMU | SCF PMU | SCF PMU  | NVLink-C2C0 |
   | SYSRAM/CMEM     |           |         |          | PMU         |
   +-----------------+-----------+---------+----------+-------------+
   | Remote          |           |         |          |             |
   | SYSRAM/CMEM     |  PCIE PMU | SCF PMU |   N/A    |     N/A     |
   | over NVLink-C2C |           |         |          |             |
   +-----------------+-----------+---------+----------+-------------+

   PCIE1 traffic represents strongly ordered (SO) writes.
   PCIE2 traffic represents reads and relaxed ordered (RO) writes.
Loading