Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux (06cff4a5) · Commits · jan.koester / Linux

Documentation/admin-guide/kernel-parameters.txt

+6 −9

Original line number	Diff line number	Diff line
		@@ -831,7 +831,7 @@
		memory region [offset, offset + size] for that kernel
		image. If '@offset' is omitted, then a suitable offset
		is selected automatically.
		[KNL, X86-64] Select a region under 4G first, and
		[KNL, X86-64, ARM64] Select a region under 4G first, and
		fall back to reserve region above 4G when '@offset'
		hasn't been specified.
		See Documentation/admin-guide/kdump/kdump.rst for further details.
		@@ -851,26 +851,23 @@
		available.
		It will be ignored if crashkernel=X is specified.
		crashkernel=size[KMG],low
		[KNL, X86-64] range under 4G. When crashkernel=X,high
		[KNL, X86-64, ARM64] range under 4G. When crashkernel=X,high
		is passed, kernel could allocate physical memory region
		above 4G, that cause second kernel crash on system
		that require some amount of low memory, e.g. swiotlb
		requires at least 64M+32K low memory, also enough extra
		low memory is needed to make sure DMA buffers for 32-bit
		devices won't run out. Kernel would try to allocate
		at least 256M below 4G automatically.
		default size of memory below 4G automatically. The default
		size is platform dependent.
		--> x86: max(swiotlb_size_or_default() + 8MiB, 256MiB)
		--> arm64: 128MiB
		This one lets the user specify own low range under 4G
		for second kernel instead.
		0: to disable low allocation.
		It will be ignored when crashkernel=X,high is not used
		or memory reserved is below 4G.

		[KNL, ARM64] range in low memory.
		This one lets the user specify a low range in the
		DMA zone for the crash dump kernel.
		It will be ignored when crashkernel=X,high is not used
		or memory reserved is located in the DMA zones.

		cryptomgr.notests
		[KNL] Disable crypto self-tests

Documentation/admin-guide/perf/hisi-pcie-pmu.rst

+68 −44

Original line number	Diff line number	Diff line
		@@ -15,10 +15,10 @@ HiSilicon PCIe PMU driver
		The PCIe PMU driver registers a perf PMU with the name of its sicl-id and PCIe
		Core id.::

		/sys/bus/event_source/hisi_pcie<sicl>_<core>
		/sys/bus/event_source/hisi_pcie<sicl>_core<core>

		PMU driver provides description of available events and filter options in sysfs,
		see /sys/bus/event_source/devices/hisi_pcie<sicl>_<core>.
		see /sys/bus/event_source/devices/hisi_pcie<sicl>_core<core>.

		The "format" directory describes all formats of the config (events) and config1
		(filter options) fields of the perf_event_attr structure. The "events" directory
		@@ -33,13 +33,13 @@ monitored by PMU.
		Example usage of perf::

		$# perf list
		hisi_pcie0_0/rx_mwr_latency/ [kernel PMU event]
		hisi_pcie0_0/rx_mwr_cnt/ [kernel PMU event]
		hisi_pcie0_core0/rx_mwr_latency/ [kernel PMU event]
		hisi_pcie0_core0/rx_mwr_cnt/ [kernel PMU event]
		------------------------------------------

		$# perf stat -e hisi_pcie0_0/rx_mwr_latency/
		$# perf stat -e hisi_pcie0_0/rx_mwr_cnt/
		$# perf stat -g -e hisi_pcie0_0/rx_mwr_latency/ -e hisi_pcie0_0/rx_mwr_cnt/
		$# perf stat -e hisi_pcie0_core0/rx_mwr_latency/
		$# perf stat -e hisi_pcie0_core0/rx_mwr_cnt/
		$# perf stat -g -e hisi_pcie0_core0/rx_mwr_latency/ -e hisi_pcie0_core0/rx_mwr_cnt/

		The current driver does not support sampling. So "perf record" is unsupported.
		Also attach to a task is unsupported for PCIe PMU.
		@@ -48,41 +48,46 @@ Filter options
		--------------

		1. Target filter
		PMU could only monitor the performance of traffic downstream target Root Ports
		or downstream target Endpoint. PCIe PMU driver support "port" and "bdf"
		interfaces for users, and these two interfaces aren't supported at the same
		time.

		PMU could only monitor the performance of traffic downstream target Root
		Ports or downstream target Endpoint. PCIe PMU driver support "port" and
		"bdf" interfaces for users, and these two interfaces aren't supported at the
		same time.

		- port

		"port" filter can be used in all PCIe PMU events, target Root Port can be
		selected by configuring the 16-bits-bitmap "port". Multi ports can be selected
		for AP-layer-events, and only one port can be selected for TL/DL-layer-events.
		selected by configuring the 16-bits-bitmap "port". Multi ports can be
		selected for AP-layer-events, and only one port can be selected for
		TL/DL-layer-events.

		For example, if target Root Port is 0000:00:00.0 (x8 lanes), bit0 of bitmap
		should be set, port=0x1; if target Root Port is 0000:00:04.0 (x4 lanes),
		bit8 is set, port=0x100; if these two Root Ports are both monitored, port=0x101.
		For example, if target Root Port is 0000:00:00.0 (x8 lanes), bit0 of
		bitmap should be set, port=0x1; if target Root Port is 0000:00:04.0 (x4
		lanes), bit8 is set, port=0x100; if these two Root Ports are both
		monitored, port=0x101.

		Example usage of perf::

		$# perf stat -e hisi_pcie0_0/rx_mwr_latency,port=0x1/ sleep 5
		$# perf stat -e hisi_pcie0_core0/rx_mwr_latency,port=0x1/ sleep 5

		- bdf

		"bdf" filter can only be used in bandwidth events, target Endpoint is selected
		by configuring BDF to "bdf". Counter only counts the bandwidth of message
		requested by target Endpoint.
		"bdf" filter can only be used in bandwidth events, target Endpoint is
		selected by configuring BDF to "bdf". Counter only counts the bandwidth of
		message requested by target Endpoint.

		For example, "bdf=0x3900" means BDF of target Endpoint is 0000:39:00.0.

		Example usage of perf::

		$# perf stat -e hisi_pcie0_0/rx_mrd_flux,bdf=0x3900/ sleep 5
		$# perf stat -e hisi_pcie0_core0/rx_mrd_flux,bdf=0x3900/ sleep 5

		2. Trigger filter

		Event statistics start when the first time TLP length is greater/smaller
		than trigger condition. You can set the trigger condition by writing "trig_len",
		and set the trigger mode by writing "trig_mode". This filter can only be used
		in bandwidth events.
		than trigger condition. You can set the trigger condition by writing
		"trig_len", and set the trigger mode by writing "trig_mode". This filter can
		only be used in bandwidth events.

		For example, "trig_len=4" means trigger condition is 2^4 DW, "trig_mode=0"
		means statistics start when TLP length > trigger condition, "trig_mode=1"
		@@ -90,9 +95,10 @@ means start when TLP length < condition.

		Example usage of perf::

		$# perf stat -e hisi_pcie0_0/rx_mrd_flux,trig_len=0x4,trig_mode=1/ sleep 5
		$# perf stat -e hisi_pcie0_core0/rx_mrd_flux,trig_len=0x4,trig_mode=1/ sleep 5

		3. Threshold filter

		Counter counts when TLP length within the specified range. You can set the
		threshold by writing "thr_len", and set the threshold mode by writing
		"thr_mode". This filter can only be used in bandwidth events.
		@@ -103,4 +109,22 @@ when TLP length < threshold.

		Example usage of perf::

		$# perf stat -e hisi_pcie0_0/rx_mrd_flux,thr_len=0x4,thr_mode=1/ sleep 5
		$# perf stat -e hisi_pcie0_core0/rx_mrd_flux,thr_len=0x4,thr_mode=1/ sleep 5

		4. TLP Length filter

		When counting bandwidth, the data can be composed of certain parts of TLP
		packets. You can specify it through "len_mode":

		- 2'b00: Reserved (Do not use this since the behaviour is undefined)
		- 2'b01: Bandwidth of TLP payloads
		- 2'b10: Bandwidth of TLP headers
		- 2'b11: Bandwidth of both TLP payloads and headers

		For example, "len_mode=2" means only counting the bandwidth of TLP headers
		and "len_mode=3" means the final bandwidth data is composed of both TLP
		headers and payloads. Default value if not specified is 2'b11.

		Example usage of perf::

		$# perf stat -e hisi_pcie0_core0/rx_mrd_flux,len_mode=0x1/ sleep 5

Documentation/admin-guide/perf/index.rst

+2 −0

Original line number	Diff line number	Diff line
		@@ -19,3 +19,5 @@ Performance monitor support
		arm_dsu_pmu
		thunderx2-pmu
		alibaba_pmu
		nvidia-pmu
		meson-ddr-pmu

Documentation/admin-guide/perf/meson-ddr-pmu.rst

0 → 100644

+70 −0

Original line number	Diff line number	Diff line
		.. SPDX-License-Identifier: GPL-2.0

		===========================================================
		Amlogic SoC DDR Bandwidth Performance Monitoring Unit (PMU)
		===========================================================

		The Amlogic Meson G12 SoC contains a bandwidth monitor inside DRAM controller.
		The monitor includes 4 channels. Each channel can count the request accessing
		DRAM. The channel can count up to 3 AXI port simultaneously. It can be helpful
		to show if the performance bottleneck is on DDR bandwidth.

		Currently, this driver supports the following 5 perf events:

		+ meson_ddr_bw/total_rw_bytes/
		+ meson_ddr_bw/chan_1_rw_bytes/
		+ meson_ddr_bw/chan_2_rw_bytes/
		+ meson_ddr_bw/chan_3_rw_bytes/
		+ meson_ddr_bw/chan_4_rw_bytes/

		meson_ddr_bw/chan_{1,2,3,4}_rw_bytes/ events are channel-specific events.
		Each channel support filtering, which can let the channel to monitor
		individual IP module in SoC.

		Below are DDR access request event filter keywords:

		+ arm - from CPU
		+ vpu_read1 - from OSD + VPP read
		+ gpu - from 3D GPU
		+ pcie - from PCIe controller
		+ hdcp - from HDCP controller
		+ hevc_front - from HEVC codec front end
		+ usb3_0 - from USB3.0 controller
		+ hevc_back - from HEVC codec back end
		+ h265enc - from HEVC encoder
		+ vpu_read2 - from DI read
		+ vpu_write1 - from VDIN write
		+ vpu_write2 - from di write
		+ vdec - from legacy codec video decoder
		+ hcodec - from H264 encoder
		+ ge2d - from ge2d
		+ spicc1 - from SPI controller 1
		+ usb0 - from USB2.0 controller 0
		+ dma - from system DMA controller 1
		+ arb0 - from arb0
		+ sd_emmc_b - from SD eMMC b controller
		+ usb1 - from USB2.0 controller 1
		+ audio - from Audio module
		+ sd_emmc_c - from SD eMMC c controller
		+ spicc2 - from SPI controller 2
		+ ethernet - from Ethernet controller


		Examples:

		+ Show the total DDR bandwidth per seconds:

		.. code-block:: bash

		perf stat -a -e meson_ddr_bw/total_rw_bytes/ -I 1000 sleep 10


		+ Show individual DDR bandwidth from CPU and GPU respectively, as well as
		sum of them:

		.. code-block:: bash

		perf stat -a -e meson_ddr_bw/chan_1_rw_bytes,arm=1/ -I 1000 sleep 10
		perf stat -a -e meson_ddr_bw/chan_2_rw_bytes,gpu=1/ -I 1000 sleep 10
		perf stat -a -e meson_ddr_bw/chan_3_rw_bytes,arm=1,gpu=1/ -I 1000 sleep 10

Documentation/admin-guide/perf/nvidia-pmu.rst

0 → 100644

+299 −0

Original line number	Diff line number	Diff line
		=========================================================
		NVIDIA Tegra SoC Uncore Performance Monitoring Unit (PMU)
		=========================================================

		The NVIDIA Tegra SoC includes various system PMUs to measure key performance
		metrics like memory bandwidth, latency, and utilization:

		* Scalable Coherency Fabric (SCF)
		* NVLink-C2C0
		* NVLink-C2C1
		* CNVLink
		* PCIE

		PMU Driver
		----------

		The PMUs in this document are based on ARM CoreSight PMU Architecture as
		described in document: ARM IHI 0091. Since this is a standard architecture, the
		PMUs are managed by a common driver "arm-cs-arch-pmu". This driver describes
		the available events and configuration of each PMU in sysfs. Please see the
		sections below to get the sysfs path of each PMU. Like other uncore PMU drivers,
		the driver provides "cpumask" sysfs attribute to show the CPU id used to handle
		the PMU event. There is also "associated_cpus" sysfs attribute, which contains a
		list of CPUs associated with the PMU instance.

		.. _SCF_PMU_Section:

		SCF PMU
		-------

		The SCF PMU monitors system level cache events, CPU traffic, and
		strongly-ordered (SO) PCIE write traffic to local/remote memory. Please see
		:ref:`NVIDIA_Uncore_PMU_Traffic_Coverage_Section` for more info about the PMU
		traffic coverage.

		The events and configuration options of this PMU device are described in sysfs,
		see /sys/bus/event_sources/devices/nvidia_scf_pmu_<socket-id>.

		Example usage:

		* Count event id 0x0 in socket 0::

		perf stat -a -e nvidia_scf_pmu_0/event=0x0/

		* Count event id 0x0 in socket 1::

		perf stat -a -e nvidia_scf_pmu_1/event=0x0/

		NVLink-C2C0 PMU
		--------------------

		The NVLink-C2C0 PMU monitors incoming traffic from a GPU/CPU connected with
		NVLink-C2C (Chip-2-Chip) interconnect. The type of traffic captured by this PMU
		varies dependent on the chip configuration:

		* NVIDIA Grace Hopper Superchip: Hopper GPU is connected with Grace SoC.

		In this config, the PMU captures GPU ATS translated or EGM traffic from the GPU.

		* NVIDIA Grace CPU Superchip: two Grace CPU SoCs are connected.

		In this config, the PMU captures read and relaxed ordered (RO) writes from
		PCIE device of the remote SoC.

		Please see :ref:`NVIDIA_Uncore_PMU_Traffic_Coverage_Section` for more info about
		the PMU traffic coverage.

		The events and configuration options of this PMU device are described in sysfs,
		see /sys/bus/event_sources/devices/nvidia_nvlink_c2c0_pmu_<socket-id>.

		Example usage:

		* Count event id 0x0 from the GPU/CPU connected with socket 0::

		perf stat -a -e nvidia_nvlink_c2c0_pmu_0/event=0x0/

		* Count event id 0x0 from the GPU/CPU connected with socket 1::

		perf stat -a -e nvidia_nvlink_c2c0_pmu_1/event=0x0/

		* Count event id 0x0 from the GPU/CPU connected with socket 2::

		perf stat -a -e nvidia_nvlink_c2c0_pmu_2/event=0x0/

		* Count event id 0x0 from the GPU/CPU connected with socket 3::

		perf stat -a -e nvidia_nvlink_c2c0_pmu_3/event=0x0/

		NVLink-C2C1 PMU
		-------------------

		The NVLink-C2C1 PMU monitors incoming traffic from a GPU connected with
		NVLink-C2C (Chip-2-Chip) interconnect. This PMU captures untranslated GPU
		traffic, in contrast with NvLink-C2C0 PMU that captures ATS translated traffic.
		Please see :ref:`NVIDIA_Uncore_PMU_Traffic_Coverage_Section` for more info about
		the PMU traffic coverage.

		The events and configuration options of this PMU device are described in sysfs,
		see /sys/bus/event_sources/devices/nvidia_nvlink_c2c1_pmu_<socket-id>.

		Example usage:

		* Count event id 0x0 from the GPU connected with socket 0::

		perf stat -a -e nvidia_nvlink_c2c1_pmu_0/event=0x0/

		* Count event id 0x0 from the GPU connected with socket 1::

		perf stat -a -e nvidia_nvlink_c2c1_pmu_1/event=0x0/

		* Count event id 0x0 from the GPU connected with socket 2::

		perf stat -a -e nvidia_nvlink_c2c1_pmu_2/event=0x0/

		* Count event id 0x0 from the GPU connected with socket 3::

		perf stat -a -e nvidia_nvlink_c2c1_pmu_3/event=0x0/

		CNVLink PMU
		---------------

		The CNVLink PMU monitors traffic from GPU and PCIE device on remote sockets
		to local memory. For PCIE traffic, this PMU captures read and relaxed ordered
		(RO) write traffic. Please see :ref:`NVIDIA_Uncore_PMU_Traffic_Coverage_Section`
		for more info about the PMU traffic coverage.

		The events and configuration options of this PMU device are described in sysfs,
		see /sys/bus/event_sources/devices/nvidia_cnvlink_pmu_<socket-id>.

		Each SoC socket can be connected to one or more sockets via CNVLink. The user can
		use "rem_socket" bitmap parameter to select the remote socket(s) to monitor.
		Each bit represents the socket number, e.g. "rem_socket=0xE" corresponds to
		socket 1 to 3.
		/sys/bus/event_sources/devices/nvidia_cnvlink_pmu_<socket-id>/format/rem_socket
		shows the valid bits that can be set in the "rem_socket" parameter.

		The PMU can not distinguish the remote traffic initiator, therefore it does not
		provide filter to select the traffic source to monitor. It reports combined
		traffic from remote GPU and PCIE devices.

		Example usage:

		* Count event id 0x0 for the traffic from remote socket 1, 2, and 3 to socket 0::

		perf stat -a -e nvidia_cnvlink_pmu_0/event=0x0,rem_socket=0xE/

		* Count event id 0x0 for the traffic from remote socket 0, 2, and 3 to socket 1::

		perf stat -a -e nvidia_cnvlink_pmu_1/event=0x0,rem_socket=0xD/

		* Count event id 0x0 for the traffic from remote socket 0, 1, and 3 to socket 2::

		perf stat -a -e nvidia_cnvlink_pmu_2/event=0x0,rem_socket=0xB/

		* Count event id 0x0 for the traffic from remote socket 0, 1, and 2 to socket 3::

		perf stat -a -e nvidia_cnvlink_pmu_3/event=0x0,rem_socket=0x7/


		PCIE PMU
		------------

		The PCIE PMU monitors all read/write traffic from PCIE root ports to
		local/remote memory. Please see :ref:`NVIDIA_Uncore_PMU_Traffic_Coverage_Section`
		for more info about the PMU traffic coverage.

		The events and configuration options of this PMU device are described in sysfs,
		see /sys/bus/event_sources/devices/nvidia_pcie_pmu_<socket-id>.

		Each SoC socket can support multiple root ports. The user can use
		"root_port" bitmap parameter to select the port(s) to monitor, i.e.
		"root_port=0xF" corresponds to root port 0 to 3.
		/sys/bus/event_sources/devices/nvidia_pcie_pmu_<socket-id>/format/root_port
		shows the valid bits that can be set in the "root_port" parameter.

		Example usage:

		* Count event id 0x0 from root port 0 and 1 of socket 0::

		perf stat -a -e nvidia_pcie_pmu_0/event=0x0,root_port=0x3/

		* Count event id 0x0 from root port 0 and 1 of socket 1::

		perf stat -a -e nvidia_pcie_pmu_1/event=0x0,root_port=0x3/

		.. _NVIDIA_Uncore_PMU_Traffic_Coverage_Section:

		Traffic Coverage
		----------------

		The PMU traffic coverage may vary dependent on the chip configuration:

		* NVIDIA Grace Hopper Superchip: Hopper GPU is connected with Grace SoC.

		Example configuration with two Grace SoCs::

		******************************* *******************************
		* SOCKET-A * * SOCKET-B *
		* * * *
		* :::::::: * * :::::::: *
		* : PCIE : * * : PCIE : *
		* :::::::: * * :::::::: *
		* \| * * \| *
		* \| * * \| *
		* ::::::: ::::::::: * * ::::::::: ::::::: *
		* : : : : * * : : : : *
		* : GPU :<--NVLink-->: Grace :<---CNVLink--->: Grace :<--NVLink-->: GPU : *
		* : : C2C : SoC : * * : SoC : C2C : : *
		* ::::::: ::::::::: * * ::::::::: ::::::: *
		* \| \| * * \| \| *
		* \| \| * * \| \| *
		* &&&&&&&& &&&&&&&& * * &&&&&&&& &&&&&&&& *
		* & GMEM & & CMEM & * * & CMEM & & GMEM & *
		* &&&&&&&& &&&&&&&& * * &&&&&&&& &&&&&&&& *
		* * * *
		******************************* *******************************

		GMEM = GPU Memory (e.g. HBM)
		CMEM = CPU Memory (e.g. LPDDR5X)

		\|
		\| Following table contains traffic coverage of Grace SoC PMU in socket-A:

		::

		+--------------+-------+-----------+-----------+-----+----------+----------+
		\| \| Source \|
		+ +-------+-----------+-----------+-----+----------+----------+
		\| Destination \| \|GPU ATS \|GPU Not-ATS\| \| Socket-B \| Socket-B \|
		\| \|PCI R/W\|Translated,\|Translated \| CPU \| CPU/PCIE1\| GPU/PCIE2\|
		\| \| \|EGM \| \| \| \| \|
		+==============+=======+===========+===========+=====+==========+==========+
		\| Local \| PCIE \|NVLink-C2C0\|NVLink-C2C1\| SCF \| SCF PMU \| CNVLink \|
		\| SYSRAM/CMEM \| PMU \|PMU \|PMU \| PMU \| \| PMU \|
		+--------------+-------+-----------+-----------+-----+----------+----------+
		\| Local GMEM \| PCIE \| N/A \|NVLink-C2C1\| SCF \| SCF PMU \| CNVLink \|
		\| \| PMU \| \|PMU \| PMU \| \| PMU \|
		+--------------+-------+-----------+-----------+-----+----------+----------+
		\| Remote \| PCIE \|NVLink-C2C0\|NVLink-C2C1\| SCF \| \| \|
		\| SYSRAM/CMEM \| PMU \|PMU \|PMU \| PMU \| N/A \| N/A \|
		\| over CNVLink \| \| \| \| \| \| \|
		+--------------+-------+-----------+-----------+-----+----------+----------+
		\| Remote GMEM \| PCIE \|NVLink-C2C0\|NVLink-C2C1\| SCF \| \| \|
		\| over CNVLink \| PMU \|PMU \|PMU \| PMU \| N/A \| N/A \|
		+--------------+-------+-----------+-----------+-----+----------+----------+

		PCIE1 traffic represents strongly ordered (SO) writes.
		PCIE2 traffic represents reads and relaxed ordered (RO) writes.

		* NVIDIA Grace CPU Superchip: two Grace CPU SoCs are connected.

		Example configuration with two Grace SoCs::

		***************** *****************
		* SOCKET-A * * SOCKET-B *
		* * * *
		* :::::::: * * :::::::: *
		* : PCIE : * * : PCIE : *
		* :::::::: * * :::::::: *
		* \| * * \| *
		* \| * * \| *
		* ::::::::: * * ::::::::: *
		* : : * * : : *
		* : Grace :<--------NVLink------->: Grace : *
		* : SoC : * C2C * : SoC : *
		* ::::::::: * * ::::::::: *
		* \| * * \| *
		* \| * * \| *
		* &&&&&&&& * * &&&&&&&& *
		* & CMEM & * * & CMEM & *
		* &&&&&&&& * * &&&&&&&& *
		* * * *
		***************** *****************

		GMEM = GPU Memory (e.g. HBM)
		CMEM = CPU Memory (e.g. LPDDR5X)

		\|
		\| Following table contains traffic coverage of Grace SoC PMU in socket-A:

		::

		+-----------------+-----------+---------+----------+-------------+
		\| \| Source \|
		+ +-----------+---------+----------+-------------+
		\| Destination \| \| \| Socket-B \| Socket-B \|
		\| \| PCI R/W \| CPU \| CPU/PCIE1\| PCIE2 \|
		\| \| \| \| \| \|
		+=================+===========+=========+==========+=============+
		\| Local \| PCIE PMU \| SCF PMU \| SCF PMU \| NVLink-C2C0 \|
		\| SYSRAM/CMEM \| \| \| \| PMU \|
		+-----------------+-----------+---------+----------+-------------+
		\| Remote \| \| \| \| \|
		\| SYSRAM/CMEM \| PCIE PMU \| SCF PMU \| N/A \| N/A \|
		\| over NVLink-C2C \| \| \| \| \|
		+-----------------+-----------+---------+----------+-------------+

		PCIE1 traffic represents strongly ordered (SO) writes.
		PCIE2 traffic represents reads and relaxed ordered (RO) writes.