Commit c5c4c6fa authored by Olaf Weber's avatar Olaf Weber Committed by Greg Kroah-Hartman
Browse files

staging/lustre/ptlrpc: make ptlrpcd threads cpt-aware



On NUMA systems, the placement of worker threads relative to the
memory they use greatly affects performance. The CPT mechanism can be
used to constrain a number of Lustre thread types, and this change
makes it possible to configure the placement of ptlrpcd threads in a
similar manner.

To simplify the code changes, the global structures used to manage
ptlrpcd threads are changed to one per CPT. In particular this means
there will be one ptlrpcd recovery thread per CPT.

To prevent ptlrpcd threads from wandering all over the system, all
ptlrpcd thread are bound to a CPT. Note that some CPT configuration
is always created, but the defaults are not likely to be correct for
a NUMA system. After discussing the options with Liang Zhen we
decided that we would not bind ptlrpcd threads to specific CPUs,
and rather trust the kernel scheduler to migrate ptlrpcd threads.

With all ptlrpcd threads bound to a CPT, but not to specific CPUs,
the load policy mechanism can be radically simplified:

- PDL_POLICY_LOCAL and PDL_POLICY_ROUND are currently identical.
- PDL_POLICY_ROUND, if fully implemented, would cost us the locality
  we are trying to achieve, so most or all calls using this policy
  would have to be changed to PDL_POLICY_LOCAL.
- PDL_POLICY_PREFERRED is not used, and cannot be implemented without
  binding ptlrpcd threads to individual CPUs.
- PDL_POLICY_SAME is rarely used, and cannot be implemented without
  binding ptlrpcd threads to individual CPUs.

The partner mechanism is also updated, because now all ptlrpcd
threads are "bound" threads. The only difference between the various
bind policies, PDB_POLICY_NONE, PDB_POLICY_FULL, PDB_POLICY_PAIR, and
PDB_POLICY_NEIGHBOR, is the number of partner threads. The bind
policy is replaced with a tunable that directly specifies the size of
the groups of ptlrpcd partner threads.

Ensure that the ptlrpc_request_set for a ptlrpcd thread is created on
the same CPT that the thread will work on. When threads are bound to
specific nodes and/or CPUs in a NUMA system, it pays to ensure that
the datastructures used by these threads are also on the same node.

Visible changes:

* ptlrpcd thread names include the CPT number, for example
  "ptlrpcd_02_07". In this case the "07" is relative to the CPT, and
  not a CPU number.

Tunables added:

* ptlrpcd_cpts (string): A CPT string describing the CPU partitions
  that ptlrpcd threads should run on. Used to make ptlrpcd threads
  run on a subset of all CPTs.

* ptlrpcd_per_cpt_max (int): The maximum number of ptlrpcd threads
  to run in a CPT.

* ptlrpcd_partner_group_size (int): The desired number of threads
  in each ptlrpcd partner thread group. Default is 2, corresponding
  to the old PDB_POLICY_PAIR. A negative value makes all ptlrpcd
  threads in a CPT partners of each other.

Tunables obsoleted:

* max_ptlrpcds: The new ptlrcpd_per_cpt_max can be used to obtain the
  same effect.

* ptlrpcd_bind_policy: The new ptlrpcd_partner_group_size can be used
  to obtain the same effect.

Internal interface changes:

* pdb_policy_t and related code have been removed. Groups of partner
  ptlrpcd threads are still created, and all threads in a partner
  group are bound on the same CPT. The ptlrpcd threads bound to a
  CPT are typically divided into several partner groups. The partner
  groups on a CPT all have an equal number of ptlrpcd threads.

* pdl_policy_t and related code have been removed. Since ptlrpcd
  threads are not bound to a specific CPU, all the code that avoids
  scheduling on the current CPU (or attempts to do so) has been
  removed as non-functional. A simplified form of PDL_POLICY_LOCAL
  is kept as the only load policy.

* LIOD_BIND and related code have been removed. All ptlrpcd threads
  are now bound to a CPT, and no additional binding policy is
  implemented.

* ptlrpc_prep_set(): Changed to allocate a ptlrpc_request_set
  on the current CPT.

* ptlrpcd(): If an error is encountered before entering the main loop
  store the error in pc_error before exiting.

* ptlrpcd_start(): Check pc_error to verify that the ptlrpcd thread
  has successfully entered its main loop.

* ptlrpcd_init(): Initialize the struct ptlrpcd_ctl for all threads
  for a CPT before starting any of them. This closes a race during
  startup where a partner thread could reference a non-initialized
  struct ptlrpcd_ctl.

Signed-off-by: default avatarOlaf Weber <olaf@sgi.com>
Reviewed-on: http://review.whamcloud.com/13972
Intel-bug-id: https://jira.hpdd.intel.com/browse/LU-6325


Reviewed-by: default avatarGrégoire Pichon <gregoire.pichon@bull.net>
Reviewed-by: default avatarStephen Champion <schamp@sgi.com>
Reviewed-by: default avatarJames Simmons <uja.ornl@yahoo.com>
Reviewed-by: default avatarJinshan Xiong <jinshan.xiong@intel.com>
Signed-off-by: default avatarOleg Drokin <oleg.drokin@intel.com>
Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
parent 69456a03
Loading
Loading
Loading
Loading
+13 −41
Original line number Diff line number Diff line
@@ -2191,21 +2191,29 @@ struct ptlrpcd_ctl {
	 */
	struct lu_env	       pc_env;
	/**
	 * Index of ptlrpcd thread in the array.
	 * CPT the thread is bound on.
	 */
	int			 pc_index;
	int				pc_cpt;
	/**
	 * Number of the ptlrpcd's partners.
	 * Index of ptlrpcd thread in the array.
	 */
	int			 pc_npartners;
	int				pc_index;
	/**
	 * Pointer to the array of partners' ptlrpcd_ctl structure.
	 */
	struct ptlrpcd_ctl	**pc_partners;
	/**
	 * Number of the ptlrpcd's partners.
	 */
	int				pc_npartners;
	/**
	 * Record the partner index to be processed next.
	 */
	int			 pc_cursor;
	/**
	 * Error code if the thread failed to fully start.
	 */
	int				pc_error;
};

/* Bits for pc_flags */
@@ -2228,10 +2236,6 @@ enum ptlrpcd_ctl_flags {
	 * This is a recovery ptlrpc thread.
	 */
	LIOD_RECOVERY    = 1 << 3,
	/**
	 * The ptlrpcd is bound to some CPU core.
	 */
	LIOD_BIND	= 1 << 4,
};

/**
@@ -2903,43 +2907,11 @@ void ptlrpc_pinger_ir_down(void);
/** @} */
int ptlrpc_pinger_suppress_pings(void);

/* ptlrpc daemon bind policy */
typedef enum {
	/* all ptlrpcd threads are free mode */
	PDB_POLICY_NONE	  = 1,
	/* all ptlrpcd threads are bound mode */
	PDB_POLICY_FULL	  = 2,
	/* <free1 bound1> <free2 bound2> ... <freeN boundN> */
	PDB_POLICY_PAIR	  = 3,
	/* <free1 bound1> <bound1 free2> ... <freeN boundN> <boundN free1>,
	 * means each ptlrpcd[X] has two partners: thread[X-1] and thread[X+1].
	 * If kernel supports NUMA, pthrpcd threads are binded and
	 * grouped by NUMA node */
	PDB_POLICY_NEIGHBOR      = 4,
} pdb_policy_t;

/* ptlrpc daemon load policy
 * It is caller's duty to specify how to push the async RPC into some ptlrpcd
 * queue, but it is not enforced, affected by "ptlrpcd_bind_policy". If it is
 * "PDB_POLICY_FULL", then the RPC will be processed by the selected ptlrpcd,
 * Otherwise, the RPC may be processed by the selected ptlrpcd or its partner,
 * depends on which is scheduled firstly, to accelerate the RPC processing. */
typedef enum {
	/* on the same CPU core as the caller */
	PDL_POLICY_SAME	 = 1,
	/* within the same CPU partition, but not the same core as the caller */
	PDL_POLICY_LOCAL	= 2,
	/* round-robin on all CPU cores, but not the same core as the caller */
	PDL_POLICY_ROUND	= 3,
	/* the specified CPU core is preferred, but not enforced */
	PDL_POLICY_PREFERRED    = 4,
} pdl_policy_t;

/* ptlrpc/ptlrpcd.c */
void ptlrpcd_stop(struct ptlrpcd_ctl *pc, int force);
void ptlrpcd_free(struct ptlrpcd_ctl *pc);
void ptlrpcd_wake(struct ptlrpc_request *req);
void ptlrpcd_add_req(struct ptlrpc_request *req, pdl_policy_t policy, int idx);
void ptlrpcd_add_req(struct ptlrpc_request *req);
void ptlrpcd_add_rqset(struct ptlrpc_request_set *set);
int ptlrpcd_addref(void);
void ptlrpcd_decref(void);
+4 −4
Original line number Diff line number Diff line
@@ -1212,12 +1212,12 @@ int ldlm_cli_cancel_req(struct obd_export *exp, struct list_head *cancels,

		ptlrpc_request_set_replen(req);
		if (flags & LCF_ASYNC) {
			ptlrpcd_add_req(req, PDL_POLICY_LOCAL, -1);
			ptlrpcd_add_req(req);
			sent = count;
			goto out;
		} else {
			rc = ptlrpc_queue_wait(req);
		}

		rc = ptlrpc_queue_wait(req);
		if (rc == LUSTRE_ESTALE) {
			CDEBUG(D_DLMTRACE, "client/server (nid %s) out of sync -- not fatal\n",
			       libcfs_nid2str(req->rq_import->
@@ -2223,7 +2223,7 @@ static int replay_one_lock(struct obd_import *imp, struct ldlm_lock *lock)
	aa = ptlrpc_req_async_args(req);
	aa->lock_handle = body->lock_handle[0];
	req->rq_interpret_reply = (ptlrpc_interpterer_t)replay_lock_interpret;
	ptlrpcd_add_req(req, PDL_POLICY_LOCAL, -1);
	ptlrpcd_add_req(req);

	return 0;
}
+1 −1
Original line number Diff line number Diff line
@@ -1307,7 +1307,7 @@ int mdc_intent_getattr_async(struct obd_export *exp,
	ga->ga_einfo = einfo;

	req->rq_interpret_reply = mdc_intent_getattr_async_interpret;
	ptlrpcd_add_req(req, PDL_POLICY_LOCAL, -1);
	ptlrpcd_add_req(req);

	return 0;
}
+1 −1
Original line number Diff line number Diff line
@@ -2639,7 +2639,7 @@ static int mdc_renew_capa(struct obd_export *exp, struct obd_capa *oc,
	ra->ra_oc = oc;
	ra->ra_cb = cb;
	req->rq_interpret_reply = mdc_interpret_renew_capa;
	ptlrpcd_add_req(req, PDL_POLICY_LOCAL, -1);
	ptlrpcd_add_req(req);
	return 0;
}

+13 −15
Original line number Diff line number Diff line
@@ -1934,7 +1934,7 @@ static int get_write_extents(struct osc_object *obj, struct list_head *rpclist)

static int
osc_send_write_rpc(const struct lu_env *env, struct client_obd *cli,
		   struct osc_object *osc, pdl_policy_t pol)
		   struct osc_object *osc)
{
	LIST_HEAD(rpclist);
	struct osc_extent *ext;
@@ -1986,7 +1986,7 @@ osc_send_write_rpc(const struct lu_env *env, struct client_obd *cli,

	if (!list_empty(&rpclist)) {
		LASSERT(page_count > 0);
		rc = osc_build_rpc(env, cli, &rpclist, OBD_BRW_WRITE, pol);
		rc = osc_build_rpc(env, cli, &rpclist, OBD_BRW_WRITE);
		LASSERT(list_empty(&rpclist));
	}

@@ -2006,7 +2006,7 @@ osc_send_write_rpc(const struct lu_env *env, struct client_obd *cli,
 */
static int
osc_send_read_rpc(const struct lu_env *env, struct client_obd *cli,
		  struct osc_object *osc, pdl_policy_t pol)
		  struct osc_object *osc)
{
	struct osc_extent *ext;
	struct osc_extent *next;
@@ -2033,7 +2033,7 @@ osc_send_read_rpc(const struct lu_env *env, struct client_obd *cli,
		osc_object_unlock(osc);

		LASSERT(page_count > 0);
		rc = osc_build_rpc(env, cli, &rpclist, OBD_BRW_READ, pol);
		rc = osc_build_rpc(env, cli, &rpclist, OBD_BRW_READ);
		LASSERT(list_empty(&rpclist));

		osc_object_lock(osc);
@@ -2079,8 +2079,7 @@ static struct osc_object *osc_next_obj(struct client_obd *cli)
}

/* called with the loi list lock held */
static void osc_check_rpcs(const struct lu_env *env, struct client_obd *cli,
			   pdl_policy_t pol)
static void osc_check_rpcs(const struct lu_env *env, struct client_obd *cli)
{
	struct osc_object *osc;
	int rc = 0;
@@ -2109,7 +2108,7 @@ static void osc_check_rpcs(const struct lu_env *env, struct client_obd *cli,
		 * do io on writes while there are cache waiters */
		osc_object_lock(osc);
		if (osc_makes_rpc(cli, osc, OBD_BRW_WRITE)) {
			rc = osc_send_write_rpc(env, cli, osc, pol);
			rc = osc_send_write_rpc(env, cli, osc);
			if (rc < 0) {
				CERROR("Write request failed with %d\n", rc);

@@ -2133,7 +2132,7 @@ static void osc_check_rpcs(const struct lu_env *env, struct client_obd *cli,
			}
		}
		if (osc_makes_rpc(cli, osc, OBD_BRW_READ)) {
			rc = osc_send_read_rpc(env, cli, osc, pol);
			rc = osc_send_read_rpc(env, cli, osc);
			if (rc < 0)
				CERROR("Read request failed with %d\n", rc);
		}
@@ -2149,7 +2148,7 @@ static void osc_check_rpcs(const struct lu_env *env, struct client_obd *cli,
}

static int osc_io_unplug0(const struct lu_env *env, struct client_obd *cli,
			  struct osc_object *osc, pdl_policy_t pol, int async)
			  struct osc_object *osc, int async)
{
	int rc = 0;

@@ -2161,7 +2160,7 @@ static int osc_io_unplug0(const struct lu_env *env, struct client_obd *cli,
		 * potential stack overrun problem. LU-2859 */
		atomic_inc(&cli->cl_lru_shrinkers);
		client_obd_list_lock(&cli->cl_loi_list_lock);
		osc_check_rpcs(env, cli, pol);
		osc_check_rpcs(env, cli);
		client_obd_list_unlock(&cli->cl_loi_list_lock);
		atomic_dec(&cli->cl_lru_shrinkers);
	} else {
@@ -2175,14 +2174,13 @@ static int osc_io_unplug0(const struct lu_env *env, struct client_obd *cli,
static int osc_io_unplug_async(const struct lu_env *env,
			       struct client_obd *cli, struct osc_object *osc)
{
	/* XXX: policy is no use actually. */
	return osc_io_unplug0(env, cli, osc, PDL_POLICY_ROUND, 1);
	return osc_io_unplug0(env, cli, osc, 1);
}

void osc_io_unplug(const struct lu_env *env, struct client_obd *cli,
		   struct osc_object *osc, pdl_policy_t pol)
		   struct osc_object *osc)
{
	(void)osc_io_unplug0(env, cli, osc, pol, 0);
	(void)osc_io_unplug0(env, cli, osc, 0);
}

int osc_prep_async_page(struct osc_object *osc, struct osc_page *ops,
@@ -2922,7 +2920,7 @@ int osc_cache_writeback_range(const struct lu_env *env, struct osc_object *obj,
	}

	if (unplug)
		osc_io_unplug(env, osc_cli(obj), obj, PDL_POLICY_ROUND);
		osc_io_unplug(env, osc_cli(obj), obj);

	if (hp || discard) {
		int rc;
Loading