VYPR
Unrated severityNVD Advisory· Published May 27, 2026· Updated May 27, 2026

CVE-2026-45859

CVE-2026-45859

Description

In the Linux kernel, the following vulnerability has been resolved:

netfilter: nfnetlink_queue: do shared-unconfirmed check before segmentation

Ulrich reports a regression with nfqueue:

If an application did not set the 'F_GSO' capability flag and a gso packet with an unconfirmed nf_conn entry is received all packets are now dropped instead of queued, because the check happens after skb_gso_segment(). In that case, we did have exclusive ownership of the skb and its associated conntrack entry. The elevated use count is due to skb_clone happening via skb_gso_segment().

Move the check so that its peformed vs. the aggregated packet.

Then, annotate the individual segments except the first one so we can do a 2nd check at reinject time.

For the normal case, where userspace does in-order reinjects, this avoids packet drops: first reinjected segment continues traversal and confirms entry, remaining segments observe the confirmed entry.

While at it, simplify nf_ct_drop_unconfirmed(): We only care about unconfirmed entries with a refcnt > 1, there is no need to special-case dying entries.

This only happens with UDP. With TCP, the only unconfirmed packet will be the TCP SYN, those aren't aggregated by GRO.

Next patch adds a udpgro test case to cover this scenario.

AI Insight

LLM-synthesized narrative grounded in this CVE's description and references.

Netfilter nfqueue causes packet drops for GSO packets with unconfirmed conntrack entries when F_GSO flag is not set; fixed by moving check before segmentation.

Vulnerability

In the Linux kernel, a regression was introduced in the nfnetlink_queue module where packets with unconfirmed conntrack entries and GSO (Generic Segmentation Offload) are dropped instead of queued. The issue occurs when userspace applications do not set the F_GSO capability flag. The check for shared-unconfirmed conntrack entries was moved after skb_gso_segment(), causing the elevated reference count from skb_clone to trigger a false positive drop. This affects all Linux kernel versions containing the problematic commit.

Exploitation

An attacker can trigger this vulnerability by sending UDP GSO packets (TCP SYN packets are not aggregated by GRO, so only UDP is affected) to a system running a nfqueue application that did not advertise the F_GSO capability. The packets must contain an unconfirmed conntrack entry. No authentication or special privileges are required from the attacker; the exploit only relies on network access.

Impact

Successful exploitation leads to denial of service: legitimate UDP GSO packets are silently dropped, preventing them from reaching the intended application. This can cause disruption to services relying on nfqueue for packet processing (e.g., network filters, intrusion detection systems).

Mitigation

The fix is included in the Linux kernel commit [1] (stable tree). Users should apply the patch or update to a kernel version containing this fix. If patching is not possible, a workaround is to ensure nfqueue applications always set the F_GSO capability flag, though this may not be feasible in all cases. The issue is not known to be exploited in the wild.

AI Insight generated on May 27, 2026. Synthesized from this CVE's description and the cited reference URLs; citations are validated against the source bundle.

Affected products

1

Patches

8
b740e7ddd7ca

netfilter: nfnetlink_queue: do shared-unconfirmed check before segmentation

https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.gitFlorian WestphalNov 20, 2025Fixed in 6.19.4via kernel-cna
4 files changed · +150 100
  • include/net/netfilter/nf_queue.h+1 0 modified
    diff --git a/include/net/netfilter/nf_queue.h b/include/net/netfilter/nf_queue.h
    index e6803831d6af51..45eb26b2e95b37 100644
    --- a/include/net/netfilter/nf_queue.h
    +++ b/include/net/netfilter/nf_queue.h
    @@ -21,6 +21,7 @@ struct nf_queue_entry {
     	struct net_device	*physout;
     #endif
     	struct nf_hook_state	state;
    +	bool			nf_ct_is_unconfirmed;
     	u16			size; /* sizeof(entry) + saved route keys */
     	u16			queue_num;
     
    
  • include/net/netfilter/nf_queue.h+1 0 modified
    diff --git a/include/net/netfilter/nf_queue.h b/include/net/netfilter/nf_queue.h
    index e6803831d6af51..45eb26b2e95b37 100644
    --- a/include/net/netfilter/nf_queue.h
    +++ b/include/net/netfilter/nf_queue.h
    @@ -21,6 +21,7 @@ struct nf_queue_entry {
     	struct net_device	*physout;
     #endif
     	struct nf_hook_state	state;
    +	bool			nf_ct_is_unconfirmed;
     	u16			size; /* sizeof(entry) + saved route keys */
     	u16			queue_num;
     
    
  • net/netfilter/nfnetlink_queue.c+74 50 modified
    diff --git a/net/netfilter/nfnetlink_queue.c b/net/netfilter/nfnetlink_queue.c
    index 336e3ad18e72dc..34548213f2f14f 100644
    --- a/net/netfilter/nfnetlink_queue.c
    +++ b/net/netfilter/nfnetlink_queue.c
    @@ -438,6 +438,34 @@ next_hook:
     	nf_queue_entry_free(entry);
     }
     
    +/* return true if the entry has an unconfirmed conntrack attached that isn't owned by us
    + * exclusively.
    + */
    +static bool nf_ct_drop_unconfirmed(const struct nf_queue_entry *entry, bool *is_unconfirmed)
    +{
    +#if IS_ENABLED(CONFIG_NF_CONNTRACK)
    +	struct nf_conn *ct = (void *)skb_nfct(entry->skb);
    +
    +	if (!ct || nf_ct_is_confirmed(ct))
    +		return false;
    +
    +	if (is_unconfirmed)
    +		*is_unconfirmed = true;
    +
    +	/* in some cases skb_clone() can occur after initial conntrack
    +	 * pickup, but conntrack assumes exclusive skb->_nfct ownership for
    +	 * unconfirmed entries.
    +	 *
    +	 * This happens for br_netfilter and with ip multicast routing.
    +	 * This can't be solved with serialization here because one clone
    +	 * could have been queued for local delivery or could be transmitted
    +	 * in parallel on another CPU.
    +	 */
    +	return refcount_read(&ct->ct_general.use) > 1;
    +#endif
    +	return false;
    +}
    +
     static void nfqnl_reinject(struct nf_queue_entry *entry, unsigned int verdict)
     {
     	const struct nf_ct_hook *ct_hook;
    @@ -465,6 +493,24 @@ static void nfqnl_reinject(struct nf_queue_entry *entry, unsigned int verdict)
     			break;
     		}
     	}
    +
    +	if (verdict != NF_DROP && entry->nf_ct_is_unconfirmed) {
    +		/* If first queued segment was already reinjected then
    +		 * there is a good chance the ct entry is now confirmed.
    +		 *
    +		 * Handle the rare cases:
    +		 *  - out-of-order verdict
    +		 *  - threaded userspace reinjecting in parallel
    +		 *  - first segment was dropped
    +		 *
    +		 * In all of those cases we can't handle this packet
    +		 * because we can't be sure that another CPU won't modify
    +		 * nf_conn->ext in parallel which isn't allowed.
    +		 */
    +		if (nf_ct_drop_unconfirmed(entry, NULL))
    +			verdict = NF_DROP;
    +	}
    +
     	nf_reinject(entry, verdict);
     }
     
    @@ -894,49 +940,6 @@ nlmsg_failure:
     	return NULL;
     }
     
    -static bool nf_ct_drop_unconfirmed(const struct nf_queue_entry *entry)
    -{
    -#if IS_ENABLED(CONFIG_NF_CONNTRACK)
    -	static const unsigned long flags = IPS_CONFIRMED | IPS_DYING;
    -	struct nf_conn *ct = (void *)skb_nfct(entry->skb);
    -	unsigned long status;
    -	unsigned int use;
    -
    -	if (!ct)
    -		return false;
    -
    -	status = READ_ONCE(ct->status);
    -	if ((status & flags) == IPS_DYING)
    -		return true;
    -
    -	if (status & IPS_CONFIRMED)
    -		return false;
    -
    -	/* in some cases skb_clone() can occur after initial conntrack
    -	 * pickup, but conntrack assumes exclusive skb->_nfct ownership for
    -	 * unconfirmed entries.
    -	 *
    -	 * This happens for br_netfilter and with ip multicast routing.
    -	 * We can't be solved with serialization here because one clone could
    -	 * have been queued for local delivery.
    -	 */
    -	use = refcount_read(&ct->ct_general.use);
    -	if (likely(use == 1))
    -		return false;
    -
    -	/* Can't decrement further? Exclusive ownership. */
    -	if (!refcount_dec_not_one(&ct->ct_general.use))
    -		return false;
    -
    -	skb_set_nfct(entry->skb, 0);
    -	/* No nf_ct_put(): we already decremented .use and it cannot
    -	 * drop down to 0.
    -	 */
    -	return true;
    -#endif
    -	return false;
    -}
    -
     static int
     __nfqnl_enqueue_packet(struct net *net, struct nfqnl_instance *queue,
     			struct nf_queue_entry *entry)
    @@ -953,9 +956,6 @@ __nfqnl_enqueue_packet(struct net *net, struct nfqnl_instance *queue,
     	}
     	spin_lock_bh(&queue->lock);
     
    -	if (nf_ct_drop_unconfirmed(entry))
    -		goto err_out_free_nskb;
    -
     	if (queue->queue_total >= queue->queue_maxlen)
     		goto err_out_queue_drop;
     
    @@ -998,7 +998,6 @@ err_out_queue_drop:
     		else
     			net_warn_ratelimited("nf_queue: hash insert failed: %d\n", err);
     	}
    -err_out_free_nskb:
     	kfree_skb(nskb);
     err_out_unlock:
     	spin_unlock_bh(&queue->lock);
    @@ -1077,9 +1076,10 @@ __nfqnl_enqueue_packet_gso(struct net *net, struct nfqnl_instance *queue,
     static int
     nfqnl_enqueue_packet(struct nf_queue_entry *entry, unsigned int queuenum)
     {
    -	unsigned int queued;
    -	struct nfqnl_instance *queue;
     	struct sk_buff *skb, *segs, *nskb;
    +	bool ct_is_unconfirmed = false;
    +	struct nfqnl_instance *queue;
    +	unsigned int queued;
     	int err = -ENOBUFS;
     	struct net *net = entry->state.net;
     	struct nfnl_queue_net *q = nfnl_queue_pernet(net);
    @@ -1103,6 +1103,15 @@ nfqnl_enqueue_packet(struct nf_queue_entry *entry, unsigned int queuenum)
     		break;
     	}
     
    +	/* Check if someone already holds another reference to
    +	 * unconfirmed ct.  If so, we cannot queue the skb:
    +	 * concurrent modifications of nf_conn->ext are not
    +	 * allowed and we can't know if another CPU isn't
    +	 * processing the same nf_conn entry in parallel.
    +	 */
    +	if (nf_ct_drop_unconfirmed(entry, &ct_is_unconfirmed))
    +		return -EINVAL;
    +
     	if (!skb_is_gso(skb) || ((queue->flags & NFQA_CFG_F_GSO) && !skb_is_gso_sctp(skb)))
     		return __nfqnl_enqueue_packet(net, queue, entry);
     
    @@ -1116,7 +1125,23 @@ nfqnl_enqueue_packet(struct nf_queue_entry *entry, unsigned int queuenum)
     		goto out_err;
     	queued = 0;
     	err = 0;
    +
     	skb_list_walk_safe(segs, segs, nskb) {
    +		if (ct_is_unconfirmed && queued > 0) {
    +			/* skb_gso_segment() increments the ct refcount.
    +			 * This is a problem for unconfirmed (not in hash)
    +			 * entries, those can race when reinjections happen
    +			 * in parallel.
    +			 *
    +			 * Annotate this for all queued entries except the
    +			 * first one.
    +			 *
    +			 * As long as the first one is reinjected first it
    +			 * will do the confirmation for us.
    +			 */
    +			entry->nf_ct_is_unconfirmed = ct_is_unconfirmed;
    +		}
    +
     		if (err == 0)
     			err = __nfqnl_enqueue_packet_gso(net, queue,
     							segs, entry);
    -- 
    cgit 1.3-korg
    
    
    
  • net/netfilter/nfnetlink_queue.c+74 50 modified
    diff --git a/net/netfilter/nfnetlink_queue.c b/net/netfilter/nfnetlink_queue.c
    index 336e3ad18e72dc..34548213f2f14f 100644
    --- a/net/netfilter/nfnetlink_queue.c
    +++ b/net/netfilter/nfnetlink_queue.c
    @@ -438,6 +438,34 @@ next_hook:
     	nf_queue_entry_free(entry);
     }
     
    +/* return true if the entry has an unconfirmed conntrack attached that isn't owned by us
    + * exclusively.
    + */
    +static bool nf_ct_drop_unconfirmed(const struct nf_queue_entry *entry, bool *is_unconfirmed)
    +{
    +#if IS_ENABLED(CONFIG_NF_CONNTRACK)
    +	struct nf_conn *ct = (void *)skb_nfct(entry->skb);
    +
    +	if (!ct || nf_ct_is_confirmed(ct))
    +		return false;
    +
    +	if (is_unconfirmed)
    +		*is_unconfirmed = true;
    +
    +	/* in some cases skb_clone() can occur after initial conntrack
    +	 * pickup, but conntrack assumes exclusive skb->_nfct ownership for
    +	 * unconfirmed entries.
    +	 *
    +	 * This happens for br_netfilter and with ip multicast routing.
    +	 * This can't be solved with serialization here because one clone
    +	 * could have been queued for local delivery or could be transmitted
    +	 * in parallel on another CPU.
    +	 */
    +	return refcount_read(&ct->ct_general.use) > 1;
    +#endif
    +	return false;
    +}
    +
     static void nfqnl_reinject(struct nf_queue_entry *entry, unsigned int verdict)
     {
     	const struct nf_ct_hook *ct_hook;
    @@ -465,6 +493,24 @@ static void nfqnl_reinject(struct nf_queue_entry *entry, unsigned int verdict)
     			break;
     		}
     	}
    +
    +	if (verdict != NF_DROP && entry->nf_ct_is_unconfirmed) {
    +		/* If first queued segment was already reinjected then
    +		 * there is a good chance the ct entry is now confirmed.
    +		 *
    +		 * Handle the rare cases:
    +		 *  - out-of-order verdict
    +		 *  - threaded userspace reinjecting in parallel
    +		 *  - first segment was dropped
    +		 *
    +		 * In all of those cases we can't handle this packet
    +		 * because we can't be sure that another CPU won't modify
    +		 * nf_conn->ext in parallel which isn't allowed.
    +		 */
    +		if (nf_ct_drop_unconfirmed(entry, NULL))
    +			verdict = NF_DROP;
    +	}
    +
     	nf_reinject(entry, verdict);
     }
     
    @@ -894,49 +940,6 @@ nlmsg_failure:
     	return NULL;
     }
     
    -static bool nf_ct_drop_unconfirmed(const struct nf_queue_entry *entry)
    -{
    -#if IS_ENABLED(CONFIG_NF_CONNTRACK)
    -	static const unsigned long flags = IPS_CONFIRMED | IPS_DYING;
    -	struct nf_conn *ct = (void *)skb_nfct(entry->skb);
    -	unsigned long status;
    -	unsigned int use;
    -
    -	if (!ct)
    -		return false;
    -
    -	status = READ_ONCE(ct->status);
    -	if ((status & flags) == IPS_DYING)
    -		return true;
    -
    -	if (status & IPS_CONFIRMED)
    -		return false;
    -
    -	/* in some cases skb_clone() can occur after initial conntrack
    -	 * pickup, but conntrack assumes exclusive skb->_nfct ownership for
    -	 * unconfirmed entries.
    -	 *
    -	 * This happens for br_netfilter and with ip multicast routing.
    -	 * We can't be solved with serialization here because one clone could
    -	 * have been queued for local delivery.
    -	 */
    -	use = refcount_read(&ct->ct_general.use);
    -	if (likely(use == 1))
    -		return false;
    -
    -	/* Can't decrement further? Exclusive ownership. */
    -	if (!refcount_dec_not_one(&ct->ct_general.use))
    -		return false;
    -
    -	skb_set_nfct(entry->skb, 0);
    -	/* No nf_ct_put(): we already decremented .use and it cannot
    -	 * drop down to 0.
    -	 */
    -	return true;
    -#endif
    -	return false;
    -}
    -
     static int
     __nfqnl_enqueue_packet(struct net *net, struct nfqnl_instance *queue,
     			struct nf_queue_entry *entry)
    @@ -953,9 +956,6 @@ __nfqnl_enqueue_packet(struct net *net, struct nfqnl_instance *queue,
     	}
     	spin_lock_bh(&queue->lock);
     
    -	if (nf_ct_drop_unconfirmed(entry))
    -		goto err_out_free_nskb;
    -
     	if (queue->queue_total >= queue->queue_maxlen)
     		goto err_out_queue_drop;
     
    @@ -998,7 +998,6 @@ err_out_queue_drop:
     		else
     			net_warn_ratelimited("nf_queue: hash insert failed: %d\n", err);
     	}
    -err_out_free_nskb:
     	kfree_skb(nskb);
     err_out_unlock:
     	spin_unlock_bh(&queue->lock);
    @@ -1077,9 +1076,10 @@ __nfqnl_enqueue_packet_gso(struct net *net, struct nfqnl_instance *queue,
     static int
     nfqnl_enqueue_packet(struct nf_queue_entry *entry, unsigned int queuenum)
     {
    -	unsigned int queued;
    -	struct nfqnl_instance *queue;
     	struct sk_buff *skb, *segs, *nskb;
    +	bool ct_is_unconfirmed = false;
    +	struct nfqnl_instance *queue;
    +	unsigned int queued;
     	int err = -ENOBUFS;
     	struct net *net = entry->state.net;
     	struct nfnl_queue_net *q = nfnl_queue_pernet(net);
    @@ -1103,6 +1103,15 @@ nfqnl_enqueue_packet(struct nf_queue_entry *entry, unsigned int queuenum)
     		break;
     	}
     
    +	/* Check if someone already holds another reference to
    +	 * unconfirmed ct.  If so, we cannot queue the skb:
    +	 * concurrent modifications of nf_conn->ext are not
    +	 * allowed and we can't know if another CPU isn't
    +	 * processing the same nf_conn entry in parallel.
    +	 */
    +	if (nf_ct_drop_unconfirmed(entry, &ct_is_unconfirmed))
    +		return -EINVAL;
    +
     	if (!skb_is_gso(skb) || ((queue->flags & NFQA_CFG_F_GSO) && !skb_is_gso_sctp(skb)))
     		return __nfqnl_enqueue_packet(net, queue, entry);
     
    @@ -1116,7 +1125,23 @@ nfqnl_enqueue_packet(struct nf_queue_entry *entry, unsigned int queuenum)
     		goto out_err;
     	queued = 0;
     	err = 0;
    +
     	skb_list_walk_safe(segs, segs, nskb) {
    +		if (ct_is_unconfirmed && queued > 0) {
    +			/* skb_gso_segment() increments the ct refcount.
    +			 * This is a problem for unconfirmed (not in hash)
    +			 * entries, those can race when reinjections happen
    +			 * in parallel.
    +			 *
    +			 * Annotate this for all queued entries except the
    +			 * first one.
    +			 *
    +			 * As long as the first one is reinjected first it
    +			 * will do the confirmation for us.
    +			 */
    +			entry->nf_ct_is_unconfirmed = ct_is_unconfirmed;
    +		}
    +
     		if (err == 0)
     			err = __nfqnl_enqueue_packet_gso(net, queue,
     							segs, entry);
    -- 
    cgit 1.3-korg
    
    
    
207b3ebacb61

netfilter: nfnetlink_queue: do shared-unconfirmed check before segmentation

https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.gitFlorian WestphalNov 20, 2025Fixed in 7.0via kernel-cna
4 files changed · +150 100
  • include/net/netfilter/nf_queue.h+1 0 modified
    diff --git a/include/net/netfilter/nf_queue.h b/include/net/netfilter/nf_queue.h
    index e6803831d6af51..45eb26b2e95b37 100644
    --- a/include/net/netfilter/nf_queue.h
    +++ b/include/net/netfilter/nf_queue.h
    @@ -21,6 +21,7 @@ struct nf_queue_entry {
     	struct net_device	*physout;
     #endif
     	struct nf_hook_state	state;
    +	bool			nf_ct_is_unconfirmed;
     	u16			size; /* sizeof(entry) + saved route keys */
     	u16			queue_num;
     
    
  • include/net/netfilter/nf_queue.h+1 0 modified
    diff --git a/include/net/netfilter/nf_queue.h b/include/net/netfilter/nf_queue.h
    index e6803831d6af51..45eb26b2e95b37 100644
    --- a/include/net/netfilter/nf_queue.h
    +++ b/include/net/netfilter/nf_queue.h
    @@ -21,6 +21,7 @@ struct nf_queue_entry {
     	struct net_device	*physout;
     #endif
     	struct nf_hook_state	state;
    +	bool			nf_ct_is_unconfirmed;
     	u16			size; /* sizeof(entry) + saved route keys */
     	u16			queue_num;
     
    
  • net/netfilter/nfnetlink_queue.c+74 50 modified
    diff --git a/net/netfilter/nfnetlink_queue.c b/net/netfilter/nfnetlink_queue.c
    index 671b52c652ef6e..f1c8049861a6b7 100644
    --- a/net/netfilter/nfnetlink_queue.c
    +++ b/net/netfilter/nfnetlink_queue.c
    @@ -435,6 +435,34 @@ next_hook:
     	nf_queue_entry_free(entry);
     }
     
    +/* return true if the entry has an unconfirmed conntrack attached that isn't owned by us
    + * exclusively.
    + */
    +static bool nf_ct_drop_unconfirmed(const struct nf_queue_entry *entry, bool *is_unconfirmed)
    +{
    +#if IS_ENABLED(CONFIG_NF_CONNTRACK)
    +	struct nf_conn *ct = (void *)skb_nfct(entry->skb);
    +
    +	if (!ct || nf_ct_is_confirmed(ct))
    +		return false;
    +
    +	if (is_unconfirmed)
    +		*is_unconfirmed = true;
    +
    +	/* in some cases skb_clone() can occur after initial conntrack
    +	 * pickup, but conntrack assumes exclusive skb->_nfct ownership for
    +	 * unconfirmed entries.
    +	 *
    +	 * This happens for br_netfilter and with ip multicast routing.
    +	 * This can't be solved with serialization here because one clone
    +	 * could have been queued for local delivery or could be transmitted
    +	 * in parallel on another CPU.
    +	 */
    +	return refcount_read(&ct->ct_general.use) > 1;
    +#endif
    +	return false;
    +}
    +
     static void nfqnl_reinject(struct nf_queue_entry *entry, unsigned int verdict)
     {
     	const struct nf_ct_hook *ct_hook;
    @@ -462,6 +490,24 @@ static void nfqnl_reinject(struct nf_queue_entry *entry, unsigned int verdict)
     			break;
     		}
     	}
    +
    +	if (verdict != NF_DROP && entry->nf_ct_is_unconfirmed) {
    +		/* If first queued segment was already reinjected then
    +		 * there is a good chance the ct entry is now confirmed.
    +		 *
    +		 * Handle the rare cases:
    +		 *  - out-of-order verdict
    +		 *  - threaded userspace reinjecting in parallel
    +		 *  - first segment was dropped
    +		 *
    +		 * In all of those cases we can't handle this packet
    +		 * because we can't be sure that another CPU won't modify
    +		 * nf_conn->ext in parallel which isn't allowed.
    +		 */
    +		if (nf_ct_drop_unconfirmed(entry, NULL))
    +			verdict = NF_DROP;
    +	}
    +
     	nf_reinject(entry, verdict);
     }
     
    @@ -891,49 +937,6 @@ nlmsg_failure:
     	return NULL;
     }
     
    -static bool nf_ct_drop_unconfirmed(const struct nf_queue_entry *entry)
    -{
    -#if IS_ENABLED(CONFIG_NF_CONNTRACK)
    -	static const unsigned long flags = IPS_CONFIRMED | IPS_DYING;
    -	struct nf_conn *ct = (void *)skb_nfct(entry->skb);
    -	unsigned long status;
    -	unsigned int use;
    -
    -	if (!ct)
    -		return false;
    -
    -	status = READ_ONCE(ct->status);
    -	if ((status & flags) == IPS_DYING)
    -		return true;
    -
    -	if (status & IPS_CONFIRMED)
    -		return false;
    -
    -	/* in some cases skb_clone() can occur after initial conntrack
    -	 * pickup, but conntrack assumes exclusive skb->_nfct ownership for
    -	 * unconfirmed entries.
    -	 *
    -	 * This happens for br_netfilter and with ip multicast routing.
    -	 * We can't be solved with serialization here because one clone could
    -	 * have been queued for local delivery.
    -	 */
    -	use = refcount_read(&ct->ct_general.use);
    -	if (likely(use == 1))
    -		return false;
    -
    -	/* Can't decrement further? Exclusive ownership. */
    -	if (!refcount_dec_not_one(&ct->ct_general.use))
    -		return false;
    -
    -	skb_set_nfct(entry->skb, 0);
    -	/* No nf_ct_put(): we already decremented .use and it cannot
    -	 * drop down to 0.
    -	 */
    -	return true;
    -#endif
    -	return false;
    -}
    -
     static int
     __nfqnl_enqueue_packet(struct net *net, struct nfqnl_instance *queue,
     			struct nf_queue_entry *entry)
    @@ -950,9 +953,6 @@ __nfqnl_enqueue_packet(struct net *net, struct nfqnl_instance *queue,
     	}
     	spin_lock_bh(&queue->lock);
     
    -	if (nf_ct_drop_unconfirmed(entry))
    -		goto err_out_free_nskb;
    -
     	if (queue->queue_total >= queue->queue_maxlen)
     		goto err_out_queue_drop;
     
    @@ -995,7 +995,6 @@ err_out_queue_drop:
     		else
     			net_warn_ratelimited("nf_queue: hash insert failed: %d\n", err);
     	}
    -err_out_free_nskb:
     	kfree_skb(nskb);
     err_out_unlock:
     	spin_unlock_bh(&queue->lock);
    @@ -1074,9 +1073,10 @@ __nfqnl_enqueue_packet_gso(struct net *net, struct nfqnl_instance *queue,
     static int
     nfqnl_enqueue_packet(struct nf_queue_entry *entry, unsigned int queuenum)
     {
    -	unsigned int queued;
    -	struct nfqnl_instance *queue;
     	struct sk_buff *skb, *segs, *nskb;
    +	bool ct_is_unconfirmed = false;
    +	struct nfqnl_instance *queue;
    +	unsigned int queued;
     	int err = -ENOBUFS;
     	struct net *net = entry->state.net;
     	struct nfnl_queue_net *q = nfnl_queue_pernet(net);
    @@ -1100,6 +1100,15 @@ nfqnl_enqueue_packet(struct nf_queue_entry *entry, unsigned int queuenum)
     		break;
     	}
     
    +	/* Check if someone already holds another reference to
    +	 * unconfirmed ct.  If so, we cannot queue the skb:
    +	 * concurrent modifications of nf_conn->ext are not
    +	 * allowed and we can't know if another CPU isn't
    +	 * processing the same nf_conn entry in parallel.
    +	 */
    +	if (nf_ct_drop_unconfirmed(entry, &ct_is_unconfirmed))
    +		return -EINVAL;
    +
     	if (!skb_is_gso(skb) || ((queue->flags & NFQA_CFG_F_GSO) && !skb_is_gso_sctp(skb)))
     		return __nfqnl_enqueue_packet(net, queue, entry);
     
    @@ -1113,7 +1122,23 @@ nfqnl_enqueue_packet(struct nf_queue_entry *entry, unsigned int queuenum)
     		goto out_err;
     	queued = 0;
     	err = 0;
    +
     	skb_list_walk_safe(segs, segs, nskb) {
    +		if (ct_is_unconfirmed && queued > 0) {
    +			/* skb_gso_segment() increments the ct refcount.
    +			 * This is a problem for unconfirmed (not in hash)
    +			 * entries, those can race when reinjections happen
    +			 * in parallel.
    +			 *
    +			 * Annotate this for all queued entries except the
    +			 * first one.
    +			 *
    +			 * As long as the first one is reinjected first it
    +			 * will do the confirmation for us.
    +			 */
    +			entry->nf_ct_is_unconfirmed = ct_is_unconfirmed;
    +		}
    +
     		if (err == 0)
     			err = __nfqnl_enqueue_packet_gso(net, queue,
     							segs, entry);
    -- 
    cgit 1.3-korg
    
    
    
  • net/netfilter/nfnetlink_queue.c+74 50 modified
    diff --git a/net/netfilter/nfnetlink_queue.c b/net/netfilter/nfnetlink_queue.c
    index 671b52c652ef6e..f1c8049861a6b7 100644
    --- a/net/netfilter/nfnetlink_queue.c
    +++ b/net/netfilter/nfnetlink_queue.c
    @@ -435,6 +435,34 @@ next_hook:
     	nf_queue_entry_free(entry);
     }
     
    +/* return true if the entry has an unconfirmed conntrack attached that isn't owned by us
    + * exclusively.
    + */
    +static bool nf_ct_drop_unconfirmed(const struct nf_queue_entry *entry, bool *is_unconfirmed)
    +{
    +#if IS_ENABLED(CONFIG_NF_CONNTRACK)
    +	struct nf_conn *ct = (void *)skb_nfct(entry->skb);
    +
    +	if (!ct || nf_ct_is_confirmed(ct))
    +		return false;
    +
    +	if (is_unconfirmed)
    +		*is_unconfirmed = true;
    +
    +	/* in some cases skb_clone() can occur after initial conntrack
    +	 * pickup, but conntrack assumes exclusive skb->_nfct ownership for
    +	 * unconfirmed entries.
    +	 *
    +	 * This happens for br_netfilter and with ip multicast routing.
    +	 * This can't be solved with serialization here because one clone
    +	 * could have been queued for local delivery or could be transmitted
    +	 * in parallel on another CPU.
    +	 */
    +	return refcount_read(&ct->ct_general.use) > 1;
    +#endif
    +	return false;
    +}
    +
     static void nfqnl_reinject(struct nf_queue_entry *entry, unsigned int verdict)
     {
     	const struct nf_ct_hook *ct_hook;
    @@ -462,6 +490,24 @@ static void nfqnl_reinject(struct nf_queue_entry *entry, unsigned int verdict)
     			break;
     		}
     	}
    +
    +	if (verdict != NF_DROP && entry->nf_ct_is_unconfirmed) {
    +		/* If first queued segment was already reinjected then
    +		 * there is a good chance the ct entry is now confirmed.
    +		 *
    +		 * Handle the rare cases:
    +		 *  - out-of-order verdict
    +		 *  - threaded userspace reinjecting in parallel
    +		 *  - first segment was dropped
    +		 *
    +		 * In all of those cases we can't handle this packet
    +		 * because we can't be sure that another CPU won't modify
    +		 * nf_conn->ext in parallel which isn't allowed.
    +		 */
    +		if (nf_ct_drop_unconfirmed(entry, NULL))
    +			verdict = NF_DROP;
    +	}
    +
     	nf_reinject(entry, verdict);
     }
     
    @@ -891,49 +937,6 @@ nlmsg_failure:
     	return NULL;
     }
     
    -static bool nf_ct_drop_unconfirmed(const struct nf_queue_entry *entry)
    -{
    -#if IS_ENABLED(CONFIG_NF_CONNTRACK)
    -	static const unsigned long flags = IPS_CONFIRMED | IPS_DYING;
    -	struct nf_conn *ct = (void *)skb_nfct(entry->skb);
    -	unsigned long status;
    -	unsigned int use;
    -
    -	if (!ct)
    -		return false;
    -
    -	status = READ_ONCE(ct->status);
    -	if ((status & flags) == IPS_DYING)
    -		return true;
    -
    -	if (status & IPS_CONFIRMED)
    -		return false;
    -
    -	/* in some cases skb_clone() can occur after initial conntrack
    -	 * pickup, but conntrack assumes exclusive skb->_nfct ownership for
    -	 * unconfirmed entries.
    -	 *
    -	 * This happens for br_netfilter and with ip multicast routing.
    -	 * We can't be solved with serialization here because one clone could
    -	 * have been queued for local delivery.
    -	 */
    -	use = refcount_read(&ct->ct_general.use);
    -	if (likely(use == 1))
    -		return false;
    -
    -	/* Can't decrement further? Exclusive ownership. */
    -	if (!refcount_dec_not_one(&ct->ct_general.use))
    -		return false;
    -
    -	skb_set_nfct(entry->skb, 0);
    -	/* No nf_ct_put(): we already decremented .use and it cannot
    -	 * drop down to 0.
    -	 */
    -	return true;
    -#endif
    -	return false;
    -}
    -
     static int
     __nfqnl_enqueue_packet(struct net *net, struct nfqnl_instance *queue,
     			struct nf_queue_entry *entry)
    @@ -950,9 +953,6 @@ __nfqnl_enqueue_packet(struct net *net, struct nfqnl_instance *queue,
     	}
     	spin_lock_bh(&queue->lock);
     
    -	if (nf_ct_drop_unconfirmed(entry))
    -		goto err_out_free_nskb;
    -
     	if (queue->queue_total >= queue->queue_maxlen)
     		goto err_out_queue_drop;
     
    @@ -995,7 +995,6 @@ err_out_queue_drop:
     		else
     			net_warn_ratelimited("nf_queue: hash insert failed: %d\n", err);
     	}
    -err_out_free_nskb:
     	kfree_skb(nskb);
     err_out_unlock:
     	spin_unlock_bh(&queue->lock);
    @@ -1074,9 +1073,10 @@ __nfqnl_enqueue_packet_gso(struct net *net, struct nfqnl_instance *queue,
     static int
     nfqnl_enqueue_packet(struct nf_queue_entry *entry, unsigned int queuenum)
     {
    -	unsigned int queued;
    -	struct nfqnl_instance *queue;
     	struct sk_buff *skb, *segs, *nskb;
    +	bool ct_is_unconfirmed = false;
    +	struct nfqnl_instance *queue;
    +	unsigned int queued;
     	int err = -ENOBUFS;
     	struct net *net = entry->state.net;
     	struct nfnl_queue_net *q = nfnl_queue_pernet(net);
    @@ -1100,6 +1100,15 @@ nfqnl_enqueue_packet(struct nf_queue_entry *entry, unsigned int queuenum)
     		break;
     	}
     
    +	/* Check if someone already holds another reference to
    +	 * unconfirmed ct.  If so, we cannot queue the skb:
    +	 * concurrent modifications of nf_conn->ext are not
    +	 * allowed and we can't know if another CPU isn't
    +	 * processing the same nf_conn entry in parallel.
    +	 */
    +	if (nf_ct_drop_unconfirmed(entry, &ct_is_unconfirmed))
    +		return -EINVAL;
    +
     	if (!skb_is_gso(skb) || ((queue->flags & NFQA_CFG_F_GSO) && !skb_is_gso_sctp(skb)))
     		return __nfqnl_enqueue_packet(net, queue, entry);
     
    @@ -1113,7 +1122,23 @@ nfqnl_enqueue_packet(struct nf_queue_entry *entry, unsigned int queuenum)
     		goto out_err;
     	queued = 0;
     	err = 0;
    +
     	skb_list_walk_safe(segs, segs, nskb) {
    +		if (ct_is_unconfirmed && queued > 0) {
    +			/* skb_gso_segment() increments the ct refcount.
    +			 * This is a problem for unconfirmed (not in hash)
    +			 * entries, those can race when reinjections happen
    +			 * in parallel.
    +			 *
    +			 * Annotate this for all queued entries except the
    +			 * first one.
    +			 *
    +			 * As long as the first one is reinjected first it
    +			 * will do the confirmation for us.
    +			 */
    +			entry->nf_ct_is_unconfirmed = ct_is_unconfirmed;
    +		}
    +
     		if (err == 0)
     			err = __nfqnl_enqueue_packet_gso(net, queue,
     							segs, entry);
    -- 
    cgit 1.3-korg
    
    
    
79b713ef4261

netfilter: nfnetlink_queue: do shared-unconfirmed check before segmentation

https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.gitFlorian WestphalNov 20, 2025Fixed in 6.12.75via kernel-cna
4 files changed · +150 100
  • include/net/netfilter/nf_queue.h+1 0 modified
    diff --git a/include/net/netfilter/nf_queue.h b/include/net/netfilter/nf_queue.h
    index e6803831d6af51..45eb26b2e95b37 100644
    --- a/include/net/netfilter/nf_queue.h
    +++ b/include/net/netfilter/nf_queue.h
    @@ -21,6 +21,7 @@ struct nf_queue_entry {
     	struct net_device	*physout;
     #endif
     	struct nf_hook_state	state;
    +	bool			nf_ct_is_unconfirmed;
     	u16			size; /* sizeof(entry) + saved route keys */
     	u16			queue_num;
     
    
  • include/net/netfilter/nf_queue.h+1 0 modified
    diff --git a/include/net/netfilter/nf_queue.h b/include/net/netfilter/nf_queue.h
    index e6803831d6af51..45eb26b2e95b37 100644
    --- a/include/net/netfilter/nf_queue.h
    +++ b/include/net/netfilter/nf_queue.h
    @@ -21,6 +21,7 @@ struct nf_queue_entry {
     	struct net_device	*physout;
     #endif
     	struct nf_hook_state	state;
    +	bool			nf_ct_is_unconfirmed;
     	u16			size; /* sizeof(entry) + saved route keys */
     	u16			queue_num;
     
    
  • net/netfilter/nfnetlink_queue.c+74 50 modified
    diff --git a/net/netfilter/nfnetlink_queue.c b/net/netfilter/nfnetlink_queue.c
    index fb074e95a767d9..af35dbc19864a0 100644
    --- a/net/netfilter/nfnetlink_queue.c
    +++ b/net/netfilter/nfnetlink_queue.c
    @@ -438,6 +438,34 @@ next_hook:
     	nf_queue_entry_free(entry);
     }
     
    +/* return true if the entry has an unconfirmed conntrack attached that isn't owned by us
    + * exclusively.
    + */
    +static bool nf_ct_drop_unconfirmed(const struct nf_queue_entry *entry, bool *is_unconfirmed)
    +{
    +#if IS_ENABLED(CONFIG_NF_CONNTRACK)
    +	struct nf_conn *ct = (void *)skb_nfct(entry->skb);
    +
    +	if (!ct || nf_ct_is_confirmed(ct))
    +		return false;
    +
    +	if (is_unconfirmed)
    +		*is_unconfirmed = true;
    +
    +	/* in some cases skb_clone() can occur after initial conntrack
    +	 * pickup, but conntrack assumes exclusive skb->_nfct ownership for
    +	 * unconfirmed entries.
    +	 *
    +	 * This happens for br_netfilter and with ip multicast routing.
    +	 * This can't be solved with serialization here because one clone
    +	 * could have been queued for local delivery or could be transmitted
    +	 * in parallel on another CPU.
    +	 */
    +	return refcount_read(&ct->ct_general.use) > 1;
    +#endif
    +	return false;
    +}
    +
     static void nfqnl_reinject(struct nf_queue_entry *entry, unsigned int verdict)
     {
     	const struct nf_ct_hook *ct_hook;
    @@ -465,6 +493,24 @@ static void nfqnl_reinject(struct nf_queue_entry *entry, unsigned int verdict)
     			break;
     		}
     	}
    +
    +	if (verdict != NF_DROP && entry->nf_ct_is_unconfirmed) {
    +		/* If first queued segment was already reinjected then
    +		 * there is a good chance the ct entry is now confirmed.
    +		 *
    +		 * Handle the rare cases:
    +		 *  - out-of-order verdict
    +		 *  - threaded userspace reinjecting in parallel
    +		 *  - first segment was dropped
    +		 *
    +		 * In all of those cases we can't handle this packet
    +		 * because we can't be sure that another CPU won't modify
    +		 * nf_conn->ext in parallel which isn't allowed.
    +		 */
    +		if (nf_ct_drop_unconfirmed(entry, NULL))
    +			verdict = NF_DROP;
    +	}
    +
     	nf_reinject(entry, verdict);
     }
     
    @@ -892,49 +938,6 @@ nlmsg_failure:
     	return NULL;
     }
     
    -static bool nf_ct_drop_unconfirmed(const struct nf_queue_entry *entry)
    -{
    -#if IS_ENABLED(CONFIG_NF_CONNTRACK)
    -	static const unsigned long flags = IPS_CONFIRMED | IPS_DYING;
    -	struct nf_conn *ct = (void *)skb_nfct(entry->skb);
    -	unsigned long status;
    -	unsigned int use;
    -
    -	if (!ct)
    -		return false;
    -
    -	status = READ_ONCE(ct->status);
    -	if ((status & flags) == IPS_DYING)
    -		return true;
    -
    -	if (status & IPS_CONFIRMED)
    -		return false;
    -
    -	/* in some cases skb_clone() can occur after initial conntrack
    -	 * pickup, but conntrack assumes exclusive skb->_nfct ownership for
    -	 * unconfirmed entries.
    -	 *
    -	 * This happens for br_netfilter and with ip multicast routing.
    -	 * We can't be solved with serialization here because one clone could
    -	 * have been queued for local delivery.
    -	 */
    -	use = refcount_read(&ct->ct_general.use);
    -	if (likely(use == 1))
    -		return false;
    -
    -	/* Can't decrement further? Exclusive ownership. */
    -	if (!refcount_dec_not_one(&ct->ct_general.use))
    -		return false;
    -
    -	skb_set_nfct(entry->skb, 0);
    -	/* No nf_ct_put(): we already decremented .use and it cannot
    -	 * drop down to 0.
    -	 */
    -	return true;
    -#endif
    -	return false;
    -}
    -
     static int
     __nfqnl_enqueue_packet(struct net *net, struct nfqnl_instance *queue,
     			struct nf_queue_entry *entry)
    @@ -951,9 +954,6 @@ __nfqnl_enqueue_packet(struct net *net, struct nfqnl_instance *queue,
     	}
     	spin_lock_bh(&queue->lock);
     
    -	if (nf_ct_drop_unconfirmed(entry))
    -		goto err_out_free_nskb;
    -
     	if (queue->queue_total >= queue->queue_maxlen)
     		goto err_out_queue_drop;
     
    @@ -996,7 +996,6 @@ err_out_queue_drop:
     		else
     			net_warn_ratelimited("nf_queue: hash insert failed: %d\n", err);
     	}
    -err_out_free_nskb:
     	kfree_skb(nskb);
     err_out_unlock:
     	spin_unlock_bh(&queue->lock);
    @@ -1075,9 +1074,10 @@ __nfqnl_enqueue_packet_gso(struct net *net, struct nfqnl_instance *queue,
     static int
     nfqnl_enqueue_packet(struct nf_queue_entry *entry, unsigned int queuenum)
     {
    -	unsigned int queued;
    -	struct nfqnl_instance *queue;
     	struct sk_buff *skb, *segs, *nskb;
    +	bool ct_is_unconfirmed = false;
    +	struct nfqnl_instance *queue;
    +	unsigned int queued;
     	int err = -ENOBUFS;
     	struct net *net = entry->state.net;
     	struct nfnl_queue_net *q = nfnl_queue_pernet(net);
    @@ -1101,6 +1101,15 @@ nfqnl_enqueue_packet(struct nf_queue_entry *entry, unsigned int queuenum)
     		break;
     	}
     
    +	/* Check if someone already holds another reference to
    +	 * unconfirmed ct.  If so, we cannot queue the skb:
    +	 * concurrent modifications of nf_conn->ext are not
    +	 * allowed and we can't know if another CPU isn't
    +	 * processing the same nf_conn entry in parallel.
    +	 */
    +	if (nf_ct_drop_unconfirmed(entry, &ct_is_unconfirmed))
    +		return -EINVAL;
    +
     	if (!skb_is_gso(skb) || ((queue->flags & NFQA_CFG_F_GSO) && !skb_is_gso_sctp(skb)))
     		return __nfqnl_enqueue_packet(net, queue, entry);
     
    @@ -1114,7 +1123,23 @@ nfqnl_enqueue_packet(struct nf_queue_entry *entry, unsigned int queuenum)
     		goto out_err;
     	queued = 0;
     	err = 0;
    +
     	skb_list_walk_safe(segs, segs, nskb) {
    +		if (ct_is_unconfirmed && queued > 0) {
    +			/* skb_gso_segment() increments the ct refcount.
    +			 * This is a problem for unconfirmed (not in hash)
    +			 * entries, those can race when reinjections happen
    +			 * in parallel.
    +			 *
    +			 * Annotate this for all queued entries except the
    +			 * first one.
    +			 *
    +			 * As long as the first one is reinjected first it
    +			 * will do the confirmation for us.
    +			 */
    +			entry->nf_ct_is_unconfirmed = ct_is_unconfirmed;
    +		}
    +
     		if (err == 0)
     			err = __nfqnl_enqueue_packet_gso(net, queue,
     							segs, entry);
    -- 
    cgit 1.3-korg
    
    
    
  • net/netfilter/nfnetlink_queue.c+74 50 modified
    diff --git a/net/netfilter/nfnetlink_queue.c b/net/netfilter/nfnetlink_queue.c
    index fb074e95a767d9..af35dbc19864a0 100644
    --- a/net/netfilter/nfnetlink_queue.c
    +++ b/net/netfilter/nfnetlink_queue.c
    @@ -438,6 +438,34 @@ next_hook:
     	nf_queue_entry_free(entry);
     }
     
    +/* return true if the entry has an unconfirmed conntrack attached that isn't owned by us
    + * exclusively.
    + */
    +static bool nf_ct_drop_unconfirmed(const struct nf_queue_entry *entry, bool *is_unconfirmed)
    +{
    +#if IS_ENABLED(CONFIG_NF_CONNTRACK)
    +	struct nf_conn *ct = (void *)skb_nfct(entry->skb);
    +
    +	if (!ct || nf_ct_is_confirmed(ct))
    +		return false;
    +
    +	if (is_unconfirmed)
    +		*is_unconfirmed = true;
    +
    +	/* in some cases skb_clone() can occur after initial conntrack
    +	 * pickup, but conntrack assumes exclusive skb->_nfct ownership for
    +	 * unconfirmed entries.
    +	 *
    +	 * This happens for br_netfilter and with ip multicast routing.
    +	 * This can't be solved with serialization here because one clone
    +	 * could have been queued for local delivery or could be transmitted
    +	 * in parallel on another CPU.
    +	 */
    +	return refcount_read(&ct->ct_general.use) > 1;
    +#endif
    +	return false;
    +}
    +
     static void nfqnl_reinject(struct nf_queue_entry *entry, unsigned int verdict)
     {
     	const struct nf_ct_hook *ct_hook;
    @@ -465,6 +493,24 @@ static void nfqnl_reinject(struct nf_queue_entry *entry, unsigned int verdict)
     			break;
     		}
     	}
    +
    +	if (verdict != NF_DROP && entry->nf_ct_is_unconfirmed) {
    +		/* If first queued segment was already reinjected then
    +		 * there is a good chance the ct entry is now confirmed.
    +		 *
    +		 * Handle the rare cases:
    +		 *  - out-of-order verdict
    +		 *  - threaded userspace reinjecting in parallel
    +		 *  - first segment was dropped
    +		 *
    +		 * In all of those cases we can't handle this packet
    +		 * because we can't be sure that another CPU won't modify
    +		 * nf_conn->ext in parallel which isn't allowed.
    +		 */
    +		if (nf_ct_drop_unconfirmed(entry, NULL))
    +			verdict = NF_DROP;
    +	}
    +
     	nf_reinject(entry, verdict);
     }
     
    @@ -892,49 +938,6 @@ nlmsg_failure:
     	return NULL;
     }
     
    -static bool nf_ct_drop_unconfirmed(const struct nf_queue_entry *entry)
    -{
    -#if IS_ENABLED(CONFIG_NF_CONNTRACK)
    -	static const unsigned long flags = IPS_CONFIRMED | IPS_DYING;
    -	struct nf_conn *ct = (void *)skb_nfct(entry->skb);
    -	unsigned long status;
    -	unsigned int use;
    -
    -	if (!ct)
    -		return false;
    -
    -	status = READ_ONCE(ct->status);
    -	if ((status & flags) == IPS_DYING)
    -		return true;
    -
    -	if (status & IPS_CONFIRMED)
    -		return false;
    -
    -	/* in some cases skb_clone() can occur after initial conntrack
    -	 * pickup, but conntrack assumes exclusive skb->_nfct ownership for
    -	 * unconfirmed entries.
    -	 *
    -	 * This happens for br_netfilter and with ip multicast routing.
    -	 * We can't be solved with serialization here because one clone could
    -	 * have been queued for local delivery.
    -	 */
    -	use = refcount_read(&ct->ct_general.use);
    -	if (likely(use == 1))
    -		return false;
    -
    -	/* Can't decrement further? Exclusive ownership. */
    -	if (!refcount_dec_not_one(&ct->ct_general.use))
    -		return false;
    -
    -	skb_set_nfct(entry->skb, 0);
    -	/* No nf_ct_put(): we already decremented .use and it cannot
    -	 * drop down to 0.
    -	 */
    -	return true;
    -#endif
    -	return false;
    -}
    -
     static int
     __nfqnl_enqueue_packet(struct net *net, struct nfqnl_instance *queue,
     			struct nf_queue_entry *entry)
    @@ -951,9 +954,6 @@ __nfqnl_enqueue_packet(struct net *net, struct nfqnl_instance *queue,
     	}
     	spin_lock_bh(&queue->lock);
     
    -	if (nf_ct_drop_unconfirmed(entry))
    -		goto err_out_free_nskb;
    -
     	if (queue->queue_total >= queue->queue_maxlen)
     		goto err_out_queue_drop;
     
    @@ -996,7 +996,6 @@ err_out_queue_drop:
     		else
     			net_warn_ratelimited("nf_queue: hash insert failed: %d\n", err);
     	}
    -err_out_free_nskb:
     	kfree_skb(nskb);
     err_out_unlock:
     	spin_unlock_bh(&queue->lock);
    @@ -1075,9 +1074,10 @@ __nfqnl_enqueue_packet_gso(struct net *net, struct nfqnl_instance *queue,
     static int
     nfqnl_enqueue_packet(struct nf_queue_entry *entry, unsigned int queuenum)
     {
    -	unsigned int queued;
    -	struct nfqnl_instance *queue;
     	struct sk_buff *skb, *segs, *nskb;
    +	bool ct_is_unconfirmed = false;
    +	struct nfqnl_instance *queue;
    +	unsigned int queued;
     	int err = -ENOBUFS;
     	struct net *net = entry->state.net;
     	struct nfnl_queue_net *q = nfnl_queue_pernet(net);
    @@ -1101,6 +1101,15 @@ nfqnl_enqueue_packet(struct nf_queue_entry *entry, unsigned int queuenum)
     		break;
     	}
     
    +	/* Check if someone already holds another reference to
    +	 * unconfirmed ct.  If so, we cannot queue the skb:
    +	 * concurrent modifications of nf_conn->ext are not
    +	 * allowed and we can't know if another CPU isn't
    +	 * processing the same nf_conn entry in parallel.
    +	 */
    +	if (nf_ct_drop_unconfirmed(entry, &ct_is_unconfirmed))
    +		return -EINVAL;
    +
     	if (!skb_is_gso(skb) || ((queue->flags & NFQA_CFG_F_GSO) && !skb_is_gso_sctp(skb)))
     		return __nfqnl_enqueue_packet(net, queue, entry);
     
    @@ -1114,7 +1123,23 @@ nfqnl_enqueue_packet(struct nf_queue_entry *entry, unsigned int queuenum)
     		goto out_err;
     	queued = 0;
     	err = 0;
    +
     	skb_list_walk_safe(segs, segs, nskb) {
    +		if (ct_is_unconfirmed && queued > 0) {
    +			/* skb_gso_segment() increments the ct refcount.
    +			 * This is a problem for unconfirmed (not in hash)
    +			 * entries, those can race when reinjections happen
    +			 * in parallel.
    +			 *
    +			 * Annotate this for all queued entries except the
    +			 * first one.
    +			 *
    +			 * As long as the first one is reinjected first it
    +			 * will do the confirmation for us.
    +			 */
    +			entry->nf_ct_is_unconfirmed = ct_is_unconfirmed;
    +		}
    +
     		if (err == 0)
     			err = __nfqnl_enqueue_packet_gso(net, queue,
     							segs, entry);
    -- 
    cgit 1.3-korg
    
    
    
23901aa6b8a2

netfilter: nfnetlink_queue: do shared-unconfirmed check before segmentation

https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.gitFlorian WestphalNov 20, 2025Fixed in 6.18.14via kernel-cna
4 files changed · +150 100
  • include/net/netfilter/nf_queue.h+1 0 modified
    diff --git a/include/net/netfilter/nf_queue.h b/include/net/netfilter/nf_queue.h
    index e6803831d6af51..45eb26b2e95b37 100644
    --- a/include/net/netfilter/nf_queue.h
    +++ b/include/net/netfilter/nf_queue.h
    @@ -21,6 +21,7 @@ struct nf_queue_entry {
     	struct net_device	*physout;
     #endif
     	struct nf_hook_state	state;
    +	bool			nf_ct_is_unconfirmed;
     	u16			size; /* sizeof(entry) + saved route keys */
     	u16			queue_num;
     
    
  • include/net/netfilter/nf_queue.h+1 0 modified
    diff --git a/include/net/netfilter/nf_queue.h b/include/net/netfilter/nf_queue.h
    index e6803831d6af51..45eb26b2e95b37 100644
    --- a/include/net/netfilter/nf_queue.h
    +++ b/include/net/netfilter/nf_queue.h
    @@ -21,6 +21,7 @@ struct nf_queue_entry {
     	struct net_device	*physout;
     #endif
     	struct nf_hook_state	state;
    +	bool			nf_ct_is_unconfirmed;
     	u16			size; /* sizeof(entry) + saved route keys */
     	u16			queue_num;
     
    
  • net/netfilter/nfnetlink_queue.c+74 50 modified
    diff --git a/net/netfilter/nfnetlink_queue.c b/net/netfilter/nfnetlink_queue.c
    index 336e3ad18e72dc..34548213f2f14f 100644
    --- a/net/netfilter/nfnetlink_queue.c
    +++ b/net/netfilter/nfnetlink_queue.c
    @@ -438,6 +438,34 @@ next_hook:
     	nf_queue_entry_free(entry);
     }
     
    +/* return true if the entry has an unconfirmed conntrack attached that isn't owned by us
    + * exclusively.
    + */
    +static bool nf_ct_drop_unconfirmed(const struct nf_queue_entry *entry, bool *is_unconfirmed)
    +{
    +#if IS_ENABLED(CONFIG_NF_CONNTRACK)
    +	struct nf_conn *ct = (void *)skb_nfct(entry->skb);
    +
    +	if (!ct || nf_ct_is_confirmed(ct))
    +		return false;
    +
    +	if (is_unconfirmed)
    +		*is_unconfirmed = true;
    +
    +	/* in some cases skb_clone() can occur after initial conntrack
    +	 * pickup, but conntrack assumes exclusive skb->_nfct ownership for
    +	 * unconfirmed entries.
    +	 *
    +	 * This happens for br_netfilter and with ip multicast routing.
    +	 * This can't be solved with serialization here because one clone
    +	 * could have been queued for local delivery or could be transmitted
    +	 * in parallel on another CPU.
    +	 */
    +	return refcount_read(&ct->ct_general.use) > 1;
    +#endif
    +	return false;
    +}
    +
     static void nfqnl_reinject(struct nf_queue_entry *entry, unsigned int verdict)
     {
     	const struct nf_ct_hook *ct_hook;
    @@ -465,6 +493,24 @@ static void nfqnl_reinject(struct nf_queue_entry *entry, unsigned int verdict)
     			break;
     		}
     	}
    +
    +	if (verdict != NF_DROP && entry->nf_ct_is_unconfirmed) {
    +		/* If first queued segment was already reinjected then
    +		 * there is a good chance the ct entry is now confirmed.
    +		 *
    +		 * Handle the rare cases:
    +		 *  - out-of-order verdict
    +		 *  - threaded userspace reinjecting in parallel
    +		 *  - first segment was dropped
    +		 *
    +		 * In all of those cases we can't handle this packet
    +		 * because we can't be sure that another CPU won't modify
    +		 * nf_conn->ext in parallel which isn't allowed.
    +		 */
    +		if (nf_ct_drop_unconfirmed(entry, NULL))
    +			verdict = NF_DROP;
    +	}
    +
     	nf_reinject(entry, verdict);
     }
     
    @@ -894,49 +940,6 @@ nlmsg_failure:
     	return NULL;
     }
     
    -static bool nf_ct_drop_unconfirmed(const struct nf_queue_entry *entry)
    -{
    -#if IS_ENABLED(CONFIG_NF_CONNTRACK)
    -	static const unsigned long flags = IPS_CONFIRMED | IPS_DYING;
    -	struct nf_conn *ct = (void *)skb_nfct(entry->skb);
    -	unsigned long status;
    -	unsigned int use;
    -
    -	if (!ct)
    -		return false;
    -
    -	status = READ_ONCE(ct->status);
    -	if ((status & flags) == IPS_DYING)
    -		return true;
    -
    -	if (status & IPS_CONFIRMED)
    -		return false;
    -
    -	/* in some cases skb_clone() can occur after initial conntrack
    -	 * pickup, but conntrack assumes exclusive skb->_nfct ownership for
    -	 * unconfirmed entries.
    -	 *
    -	 * This happens for br_netfilter and with ip multicast routing.
    -	 * We can't be solved with serialization here because one clone could
    -	 * have been queued for local delivery.
    -	 */
    -	use = refcount_read(&ct->ct_general.use);
    -	if (likely(use == 1))
    -		return false;
    -
    -	/* Can't decrement further? Exclusive ownership. */
    -	if (!refcount_dec_not_one(&ct->ct_general.use))
    -		return false;
    -
    -	skb_set_nfct(entry->skb, 0);
    -	/* No nf_ct_put(): we already decremented .use and it cannot
    -	 * drop down to 0.
    -	 */
    -	return true;
    -#endif
    -	return false;
    -}
    -
     static int
     __nfqnl_enqueue_packet(struct net *net, struct nfqnl_instance *queue,
     			struct nf_queue_entry *entry)
    @@ -953,9 +956,6 @@ __nfqnl_enqueue_packet(struct net *net, struct nfqnl_instance *queue,
     	}
     	spin_lock_bh(&queue->lock);
     
    -	if (nf_ct_drop_unconfirmed(entry))
    -		goto err_out_free_nskb;
    -
     	if (queue->queue_total >= queue->queue_maxlen)
     		goto err_out_queue_drop;
     
    @@ -998,7 +998,6 @@ err_out_queue_drop:
     		else
     			net_warn_ratelimited("nf_queue: hash insert failed: %d\n", err);
     	}
    -err_out_free_nskb:
     	kfree_skb(nskb);
     err_out_unlock:
     	spin_unlock_bh(&queue->lock);
    @@ -1077,9 +1076,10 @@ __nfqnl_enqueue_packet_gso(struct net *net, struct nfqnl_instance *queue,
     static int
     nfqnl_enqueue_packet(struct nf_queue_entry *entry, unsigned int queuenum)
     {
    -	unsigned int queued;
    -	struct nfqnl_instance *queue;
     	struct sk_buff *skb, *segs, *nskb;
    +	bool ct_is_unconfirmed = false;
    +	struct nfqnl_instance *queue;
    +	unsigned int queued;
     	int err = -ENOBUFS;
     	struct net *net = entry->state.net;
     	struct nfnl_queue_net *q = nfnl_queue_pernet(net);
    @@ -1103,6 +1103,15 @@ nfqnl_enqueue_packet(struct nf_queue_entry *entry, unsigned int queuenum)
     		break;
     	}
     
    +	/* Check if someone already holds another reference to
    +	 * unconfirmed ct.  If so, we cannot queue the skb:
    +	 * concurrent modifications of nf_conn->ext are not
    +	 * allowed and we can't know if another CPU isn't
    +	 * processing the same nf_conn entry in parallel.
    +	 */
    +	if (nf_ct_drop_unconfirmed(entry, &ct_is_unconfirmed))
    +		return -EINVAL;
    +
     	if (!skb_is_gso(skb) || ((queue->flags & NFQA_CFG_F_GSO) && !skb_is_gso_sctp(skb)))
     		return __nfqnl_enqueue_packet(net, queue, entry);
     
    @@ -1116,7 +1125,23 @@ nfqnl_enqueue_packet(struct nf_queue_entry *entry, unsigned int queuenum)
     		goto out_err;
     	queued = 0;
     	err = 0;
    +
     	skb_list_walk_safe(segs, segs, nskb) {
    +		if (ct_is_unconfirmed && queued > 0) {
    +			/* skb_gso_segment() increments the ct refcount.
    +			 * This is a problem for unconfirmed (not in hash)
    +			 * entries, those can race when reinjections happen
    +			 * in parallel.
    +			 *
    +			 * Annotate this for all queued entries except the
    +			 * first one.
    +			 *
    +			 * As long as the first one is reinjected first it
    +			 * will do the confirmation for us.
    +			 */
    +			entry->nf_ct_is_unconfirmed = ct_is_unconfirmed;
    +		}
    +
     		if (err == 0)
     			err = __nfqnl_enqueue_packet_gso(net, queue,
     							segs, entry);
    -- 
    cgit 1.3-korg
    
    
    
  • net/netfilter/nfnetlink_queue.c+74 50 modified
    diff --git a/net/netfilter/nfnetlink_queue.c b/net/netfilter/nfnetlink_queue.c
    index 336e3ad18e72dc..34548213f2f14f 100644
    --- a/net/netfilter/nfnetlink_queue.c
    +++ b/net/netfilter/nfnetlink_queue.c
    @@ -438,6 +438,34 @@ next_hook:
     	nf_queue_entry_free(entry);
     }
     
    +/* return true if the entry has an unconfirmed conntrack attached that isn't owned by us
    + * exclusively.
    + */
    +static bool nf_ct_drop_unconfirmed(const struct nf_queue_entry *entry, bool *is_unconfirmed)
    +{
    +#if IS_ENABLED(CONFIG_NF_CONNTRACK)
    +	struct nf_conn *ct = (void *)skb_nfct(entry->skb);
    +
    +	if (!ct || nf_ct_is_confirmed(ct))
    +		return false;
    +
    +	if (is_unconfirmed)
    +		*is_unconfirmed = true;
    +
    +	/* in some cases skb_clone() can occur after initial conntrack
    +	 * pickup, but conntrack assumes exclusive skb->_nfct ownership for
    +	 * unconfirmed entries.
    +	 *
    +	 * This happens for br_netfilter and with ip multicast routing.
    +	 * This can't be solved with serialization here because one clone
    +	 * could have been queued for local delivery or could be transmitted
    +	 * in parallel on another CPU.
    +	 */
    +	return refcount_read(&ct->ct_general.use) > 1;
    +#endif
    +	return false;
    +}
    +
     static void nfqnl_reinject(struct nf_queue_entry *entry, unsigned int verdict)
     {
     	const struct nf_ct_hook *ct_hook;
    @@ -465,6 +493,24 @@ static void nfqnl_reinject(struct nf_queue_entry *entry, unsigned int verdict)
     			break;
     		}
     	}
    +
    +	if (verdict != NF_DROP && entry->nf_ct_is_unconfirmed) {
    +		/* If first queued segment was already reinjected then
    +		 * there is a good chance the ct entry is now confirmed.
    +		 *
    +		 * Handle the rare cases:
    +		 *  - out-of-order verdict
    +		 *  - threaded userspace reinjecting in parallel
    +		 *  - first segment was dropped
    +		 *
    +		 * In all of those cases we can't handle this packet
    +		 * because we can't be sure that another CPU won't modify
    +		 * nf_conn->ext in parallel which isn't allowed.
    +		 */
    +		if (nf_ct_drop_unconfirmed(entry, NULL))
    +			verdict = NF_DROP;
    +	}
    +
     	nf_reinject(entry, verdict);
     }
     
    @@ -894,49 +940,6 @@ nlmsg_failure:
     	return NULL;
     }
     
    -static bool nf_ct_drop_unconfirmed(const struct nf_queue_entry *entry)
    -{
    -#if IS_ENABLED(CONFIG_NF_CONNTRACK)
    -	static const unsigned long flags = IPS_CONFIRMED | IPS_DYING;
    -	struct nf_conn *ct = (void *)skb_nfct(entry->skb);
    -	unsigned long status;
    -	unsigned int use;
    -
    -	if (!ct)
    -		return false;
    -
    -	status = READ_ONCE(ct->status);
    -	if ((status & flags) == IPS_DYING)
    -		return true;
    -
    -	if (status & IPS_CONFIRMED)
    -		return false;
    -
    -	/* in some cases skb_clone() can occur after initial conntrack
    -	 * pickup, but conntrack assumes exclusive skb->_nfct ownership for
    -	 * unconfirmed entries.
    -	 *
    -	 * This happens for br_netfilter and with ip multicast routing.
    -	 * We can't be solved with serialization here because one clone could
    -	 * have been queued for local delivery.
    -	 */
    -	use = refcount_read(&ct->ct_general.use);
    -	if (likely(use == 1))
    -		return false;
    -
    -	/* Can't decrement further? Exclusive ownership. */
    -	if (!refcount_dec_not_one(&ct->ct_general.use))
    -		return false;
    -
    -	skb_set_nfct(entry->skb, 0);
    -	/* No nf_ct_put(): we already decremented .use and it cannot
    -	 * drop down to 0.
    -	 */
    -	return true;
    -#endif
    -	return false;
    -}
    -
     static int
     __nfqnl_enqueue_packet(struct net *net, struct nfqnl_instance *queue,
     			struct nf_queue_entry *entry)
    @@ -953,9 +956,6 @@ __nfqnl_enqueue_packet(struct net *net, struct nfqnl_instance *queue,
     	}
     	spin_lock_bh(&queue->lock);
     
    -	if (nf_ct_drop_unconfirmed(entry))
    -		goto err_out_free_nskb;
    -
     	if (queue->queue_total >= queue->queue_maxlen)
     		goto err_out_queue_drop;
     
    @@ -998,7 +998,6 @@ err_out_queue_drop:
     		else
     			net_warn_ratelimited("nf_queue: hash insert failed: %d\n", err);
     	}
    -err_out_free_nskb:
     	kfree_skb(nskb);
     err_out_unlock:
     	spin_unlock_bh(&queue->lock);
    @@ -1077,9 +1076,10 @@ __nfqnl_enqueue_packet_gso(struct net *net, struct nfqnl_instance *queue,
     static int
     nfqnl_enqueue_packet(struct nf_queue_entry *entry, unsigned int queuenum)
     {
    -	unsigned int queued;
    -	struct nfqnl_instance *queue;
     	struct sk_buff *skb, *segs, *nskb;
    +	bool ct_is_unconfirmed = false;
    +	struct nfqnl_instance *queue;
    +	unsigned int queued;
     	int err = -ENOBUFS;
     	struct net *net = entry->state.net;
     	struct nfnl_queue_net *q = nfnl_queue_pernet(net);
    @@ -1103,6 +1103,15 @@ nfqnl_enqueue_packet(struct nf_queue_entry *entry, unsigned int queuenum)
     		break;
     	}
     
    +	/* Check if someone already holds another reference to
    +	 * unconfirmed ct.  If so, we cannot queue the skb:
    +	 * concurrent modifications of nf_conn->ext are not
    +	 * allowed and we can't know if another CPU isn't
    +	 * processing the same nf_conn entry in parallel.
    +	 */
    +	if (nf_ct_drop_unconfirmed(entry, &ct_is_unconfirmed))
    +		return -EINVAL;
    +
     	if (!skb_is_gso(skb) || ((queue->flags & NFQA_CFG_F_GSO) && !skb_is_gso_sctp(skb)))
     		return __nfqnl_enqueue_packet(net, queue, entry);
     
    @@ -1116,7 +1125,23 @@ nfqnl_enqueue_packet(struct nf_queue_entry *entry, unsigned int queuenum)
     		goto out_err;
     	queued = 0;
     	err = 0;
    +
     	skb_list_walk_safe(segs, segs, nskb) {
    +		if (ct_is_unconfirmed && queued > 0) {
    +			/* skb_gso_segment() increments the ct refcount.
    +			 * This is a problem for unconfirmed (not in hash)
    +			 * entries, those can race when reinjections happen
    +			 * in parallel.
    +			 *
    +			 * Annotate this for all queued entries except the
    +			 * first one.
    +			 *
    +			 * As long as the first one is reinjected first it
    +			 * will do the confirmation for us.
    +			 */
    +			entry->nf_ct_is_unconfirmed = ct_is_unconfirmed;
    +		}
    +
     		if (err == 0)
     			err = __nfqnl_enqueue_packet_gso(net, queue,
     							segs, entry);
    -- 
    cgit 1.3-korg
    
    
    
207b3ebacb61

netfilter: nfnetlink_queue: do shared-unconfirmed check before segmentation

4 files changed · +150 100
  • include/net/netfilter/nf_queue.h+1 0 modified
    diff --git a/include/net/netfilter/nf_queue.h b/include/net/netfilter/nf_queue.h
    index e6803831d6af51..45eb26b2e95b37 100644
    --- a/include/net/netfilter/nf_queue.h
    +++ b/include/net/netfilter/nf_queue.h
    @@ -21,6 +21,7 @@ struct nf_queue_entry {
     	struct net_device	*physout;
     #endif
     	struct nf_hook_state	state;
    +	bool			nf_ct_is_unconfirmed;
     	u16			size; /* sizeof(entry) + saved route keys */
     	u16			queue_num;
     
    
  • include/net/netfilter/nf_queue.h+1 0 modified
    diff --git a/include/net/netfilter/nf_queue.h b/include/net/netfilter/nf_queue.h
    index e6803831d6af51..45eb26b2e95b37 100644
    --- a/include/net/netfilter/nf_queue.h
    +++ b/include/net/netfilter/nf_queue.h
    @@ -21,6 +21,7 @@ struct nf_queue_entry {
     	struct net_device	*physout;
     #endif
     	struct nf_hook_state	state;
    +	bool			nf_ct_is_unconfirmed;
     	u16			size; /* sizeof(entry) + saved route keys */
     	u16			queue_num;
     
    
  • net/netfilter/nfnetlink_queue.c+74 50 modified
    diff --git a/net/netfilter/nfnetlink_queue.c b/net/netfilter/nfnetlink_queue.c
    index 671b52c652ef6e..f1c8049861a6b7 100644
    --- a/net/netfilter/nfnetlink_queue.c
    +++ b/net/netfilter/nfnetlink_queue.c
    @@ -435,6 +435,34 @@ next_hook:
     	nf_queue_entry_free(entry);
     }
     
    +/* return true if the entry has an unconfirmed conntrack attached that isn't owned by us
    + * exclusively.
    + */
    +static bool nf_ct_drop_unconfirmed(const struct nf_queue_entry *entry, bool *is_unconfirmed)
    +{
    +#if IS_ENABLED(CONFIG_NF_CONNTRACK)
    +	struct nf_conn *ct = (void *)skb_nfct(entry->skb);
    +
    +	if (!ct || nf_ct_is_confirmed(ct))
    +		return false;
    +
    +	if (is_unconfirmed)
    +		*is_unconfirmed = true;
    +
    +	/* in some cases skb_clone() can occur after initial conntrack
    +	 * pickup, but conntrack assumes exclusive skb->_nfct ownership for
    +	 * unconfirmed entries.
    +	 *
    +	 * This happens for br_netfilter and with ip multicast routing.
    +	 * This can't be solved with serialization here because one clone
    +	 * could have been queued for local delivery or could be transmitted
    +	 * in parallel on another CPU.
    +	 */
    +	return refcount_read(&ct->ct_general.use) > 1;
    +#endif
    +	return false;
    +}
    +
     static void nfqnl_reinject(struct nf_queue_entry *entry, unsigned int verdict)
     {
     	const struct nf_ct_hook *ct_hook;
    @@ -462,6 +490,24 @@ static void nfqnl_reinject(struct nf_queue_entry *entry, unsigned int verdict)
     			break;
     		}
     	}
    +
    +	if (verdict != NF_DROP && entry->nf_ct_is_unconfirmed) {
    +		/* If first queued segment was already reinjected then
    +		 * there is a good chance the ct entry is now confirmed.
    +		 *
    +		 * Handle the rare cases:
    +		 *  - out-of-order verdict
    +		 *  - threaded userspace reinjecting in parallel
    +		 *  - first segment was dropped
    +		 *
    +		 * In all of those cases we can't handle this packet
    +		 * because we can't be sure that another CPU won't modify
    +		 * nf_conn->ext in parallel which isn't allowed.
    +		 */
    +		if (nf_ct_drop_unconfirmed(entry, NULL))
    +			verdict = NF_DROP;
    +	}
    +
     	nf_reinject(entry, verdict);
     }
     
    @@ -891,49 +937,6 @@ nlmsg_failure:
     	return NULL;
     }
     
    -static bool nf_ct_drop_unconfirmed(const struct nf_queue_entry *entry)
    -{
    -#if IS_ENABLED(CONFIG_NF_CONNTRACK)
    -	static const unsigned long flags = IPS_CONFIRMED | IPS_DYING;
    -	struct nf_conn *ct = (void *)skb_nfct(entry->skb);
    -	unsigned long status;
    -	unsigned int use;
    -
    -	if (!ct)
    -		return false;
    -
    -	status = READ_ONCE(ct->status);
    -	if ((status & flags) == IPS_DYING)
    -		return true;
    -
    -	if (status & IPS_CONFIRMED)
    -		return false;
    -
    -	/* in some cases skb_clone() can occur after initial conntrack
    -	 * pickup, but conntrack assumes exclusive skb->_nfct ownership for
    -	 * unconfirmed entries.
    -	 *
    -	 * This happens for br_netfilter and with ip multicast routing.
    -	 * We can't be solved with serialization here because one clone could
    -	 * have been queued for local delivery.
    -	 */
    -	use = refcount_read(&ct->ct_general.use);
    -	if (likely(use == 1))
    -		return false;
    -
    -	/* Can't decrement further? Exclusive ownership. */
    -	if (!refcount_dec_not_one(&ct->ct_general.use))
    -		return false;
    -
    -	skb_set_nfct(entry->skb, 0);
    -	/* No nf_ct_put(): we already decremented .use and it cannot
    -	 * drop down to 0.
    -	 */
    -	return true;
    -#endif
    -	return false;
    -}
    -
     static int
     __nfqnl_enqueue_packet(struct net *net, struct nfqnl_instance *queue,
     			struct nf_queue_entry *entry)
    @@ -950,9 +953,6 @@ __nfqnl_enqueue_packet(struct net *net, struct nfqnl_instance *queue,
     	}
     	spin_lock_bh(&queue->lock);
     
    -	if (nf_ct_drop_unconfirmed(entry))
    -		goto err_out_free_nskb;
    -
     	if (queue->queue_total >= queue->queue_maxlen)
     		goto err_out_queue_drop;
     
    @@ -995,7 +995,6 @@ err_out_queue_drop:
     		else
     			net_warn_ratelimited("nf_queue: hash insert failed: %d\n", err);
     	}
    -err_out_free_nskb:
     	kfree_skb(nskb);
     err_out_unlock:
     	spin_unlock_bh(&queue->lock);
    @@ -1074,9 +1073,10 @@ __nfqnl_enqueue_packet_gso(struct net *net, struct nfqnl_instance *queue,
     static int
     nfqnl_enqueue_packet(struct nf_queue_entry *entry, unsigned int queuenum)
     {
    -	unsigned int queued;
    -	struct nfqnl_instance *queue;
     	struct sk_buff *skb, *segs, *nskb;
    +	bool ct_is_unconfirmed = false;
    +	struct nfqnl_instance *queue;
    +	unsigned int queued;
     	int err = -ENOBUFS;
     	struct net *net = entry->state.net;
     	struct nfnl_queue_net *q = nfnl_queue_pernet(net);
    @@ -1100,6 +1100,15 @@ nfqnl_enqueue_packet(struct nf_queue_entry *entry, unsigned int queuenum)
     		break;
     	}
     
    +	/* Check if someone already holds another reference to
    +	 * unconfirmed ct.  If so, we cannot queue the skb:
    +	 * concurrent modifications of nf_conn->ext are not
    +	 * allowed and we can't know if another CPU isn't
    +	 * processing the same nf_conn entry in parallel.
    +	 */
    +	if (nf_ct_drop_unconfirmed(entry, &ct_is_unconfirmed))
    +		return -EINVAL;
    +
     	if (!skb_is_gso(skb) || ((queue->flags & NFQA_CFG_F_GSO) && !skb_is_gso_sctp(skb)))
     		return __nfqnl_enqueue_packet(net, queue, entry);
     
    @@ -1113,7 +1122,23 @@ nfqnl_enqueue_packet(struct nf_queue_entry *entry, unsigned int queuenum)
     		goto out_err;
     	queued = 0;
     	err = 0;
    +
     	skb_list_walk_safe(segs, segs, nskb) {
    +		if (ct_is_unconfirmed && queued > 0) {
    +			/* skb_gso_segment() increments the ct refcount.
    +			 * This is a problem for unconfirmed (not in hash)
    +			 * entries, those can race when reinjections happen
    +			 * in parallel.
    +			 *
    +			 * Annotate this for all queued entries except the
    +			 * first one.
    +			 *
    +			 * As long as the first one is reinjected first it
    +			 * will do the confirmation for us.
    +			 */
    +			entry->nf_ct_is_unconfirmed = ct_is_unconfirmed;
    +		}
    +
     		if (err == 0)
     			err = __nfqnl_enqueue_packet_gso(net, queue,
     							segs, entry);
    -- 
    cgit 1.3-korg
    
    
    
  • net/netfilter/nfnetlink_queue.c+74 50 modified
    diff --git a/net/netfilter/nfnetlink_queue.c b/net/netfilter/nfnetlink_queue.c
    index 671b52c652ef6e..f1c8049861a6b7 100644
    --- a/net/netfilter/nfnetlink_queue.c
    +++ b/net/netfilter/nfnetlink_queue.c
    @@ -435,6 +435,34 @@ next_hook:
     	nf_queue_entry_free(entry);
     }
     
    +/* return true if the entry has an unconfirmed conntrack attached that isn't owned by us
    + * exclusively.
    + */
    +static bool nf_ct_drop_unconfirmed(const struct nf_queue_entry *entry, bool *is_unconfirmed)
    +{
    +#if IS_ENABLED(CONFIG_NF_CONNTRACK)
    +	struct nf_conn *ct = (void *)skb_nfct(entry->skb);
    +
    +	if (!ct || nf_ct_is_confirmed(ct))
    +		return false;
    +
    +	if (is_unconfirmed)
    +		*is_unconfirmed = true;
    +
    +	/* in some cases skb_clone() can occur after initial conntrack
    +	 * pickup, but conntrack assumes exclusive skb->_nfct ownership for
    +	 * unconfirmed entries.
    +	 *
    +	 * This happens for br_netfilter and with ip multicast routing.
    +	 * This can't be solved with serialization here because one clone
    +	 * could have been queued for local delivery or could be transmitted
    +	 * in parallel on another CPU.
    +	 */
    +	return refcount_read(&ct->ct_general.use) > 1;
    +#endif
    +	return false;
    +}
    +
     static void nfqnl_reinject(struct nf_queue_entry *entry, unsigned int verdict)
     {
     	const struct nf_ct_hook *ct_hook;
    @@ -462,6 +490,24 @@ static void nfqnl_reinject(struct nf_queue_entry *entry, unsigned int verdict)
     			break;
     		}
     	}
    +
    +	if (verdict != NF_DROP && entry->nf_ct_is_unconfirmed) {
    +		/* If first queued segment was already reinjected then
    +		 * there is a good chance the ct entry is now confirmed.
    +		 *
    +		 * Handle the rare cases:
    +		 *  - out-of-order verdict
    +		 *  - threaded userspace reinjecting in parallel
    +		 *  - first segment was dropped
    +		 *
    +		 * In all of those cases we can't handle this packet
    +		 * because we can't be sure that another CPU won't modify
    +		 * nf_conn->ext in parallel which isn't allowed.
    +		 */
    +		if (nf_ct_drop_unconfirmed(entry, NULL))
    +			verdict = NF_DROP;
    +	}
    +
     	nf_reinject(entry, verdict);
     }
     
    @@ -891,49 +937,6 @@ nlmsg_failure:
     	return NULL;
     }
     
    -static bool nf_ct_drop_unconfirmed(const struct nf_queue_entry *entry)
    -{
    -#if IS_ENABLED(CONFIG_NF_CONNTRACK)
    -	static const unsigned long flags = IPS_CONFIRMED | IPS_DYING;
    -	struct nf_conn *ct = (void *)skb_nfct(entry->skb);
    -	unsigned long status;
    -	unsigned int use;
    -
    -	if (!ct)
    -		return false;
    -
    -	status = READ_ONCE(ct->status);
    -	if ((status & flags) == IPS_DYING)
    -		return true;
    -
    -	if (status & IPS_CONFIRMED)
    -		return false;
    -
    -	/* in some cases skb_clone() can occur after initial conntrack
    -	 * pickup, but conntrack assumes exclusive skb->_nfct ownership for
    -	 * unconfirmed entries.
    -	 *
    -	 * This happens for br_netfilter and with ip multicast routing.
    -	 * We can't be solved with serialization here because one clone could
    -	 * have been queued for local delivery.
    -	 */
    -	use = refcount_read(&ct->ct_general.use);
    -	if (likely(use == 1))
    -		return false;
    -
    -	/* Can't decrement further? Exclusive ownership. */
    -	if (!refcount_dec_not_one(&ct->ct_general.use))
    -		return false;
    -
    -	skb_set_nfct(entry->skb, 0);
    -	/* No nf_ct_put(): we already decremented .use and it cannot
    -	 * drop down to 0.
    -	 */
    -	return true;
    -#endif
    -	return false;
    -}
    -
     static int
     __nfqnl_enqueue_packet(struct net *net, struct nfqnl_instance *queue,
     			struct nf_queue_entry *entry)
    @@ -950,9 +953,6 @@ __nfqnl_enqueue_packet(struct net *net, struct nfqnl_instance *queue,
     	}
     	spin_lock_bh(&queue->lock);
     
    -	if (nf_ct_drop_unconfirmed(entry))
    -		goto err_out_free_nskb;
    -
     	if (queue->queue_total >= queue->queue_maxlen)
     		goto err_out_queue_drop;
     
    @@ -995,7 +995,6 @@ err_out_queue_drop:
     		else
     			net_warn_ratelimited("nf_queue: hash insert failed: %d\n", err);
     	}
    -err_out_free_nskb:
     	kfree_skb(nskb);
     err_out_unlock:
     	spin_unlock_bh(&queue->lock);
    @@ -1074,9 +1073,10 @@ __nfqnl_enqueue_packet_gso(struct net *net, struct nfqnl_instance *queue,
     static int
     nfqnl_enqueue_packet(struct nf_queue_entry *entry, unsigned int queuenum)
     {
    -	unsigned int queued;
    -	struct nfqnl_instance *queue;
     	struct sk_buff *skb, *segs, *nskb;
    +	bool ct_is_unconfirmed = false;
    +	struct nfqnl_instance *queue;
    +	unsigned int queued;
     	int err = -ENOBUFS;
     	struct net *net = entry->state.net;
     	struct nfnl_queue_net *q = nfnl_queue_pernet(net);
    @@ -1100,6 +1100,15 @@ nfqnl_enqueue_packet(struct nf_queue_entry *entry, unsigned int queuenum)
     		break;
     	}
     
    +	/* Check if someone already holds another reference to
    +	 * unconfirmed ct.  If so, we cannot queue the skb:
    +	 * concurrent modifications of nf_conn->ext are not
    +	 * allowed and we can't know if another CPU isn't
    +	 * processing the same nf_conn entry in parallel.
    +	 */
    +	if (nf_ct_drop_unconfirmed(entry, &ct_is_unconfirmed))
    +		return -EINVAL;
    +
     	if (!skb_is_gso(skb) || ((queue->flags & NFQA_CFG_F_GSO) && !skb_is_gso_sctp(skb)))
     		return __nfqnl_enqueue_packet(net, queue, entry);
     
    @@ -1113,7 +1122,23 @@ nfqnl_enqueue_packet(struct nf_queue_entry *entry, unsigned int queuenum)
     		goto out_err;
     	queued = 0;
     	err = 0;
    +
     	skb_list_walk_safe(segs, segs, nskb) {
    +		if (ct_is_unconfirmed && queued > 0) {
    +			/* skb_gso_segment() increments the ct refcount.
    +			 * This is a problem for unconfirmed (not in hash)
    +			 * entries, those can race when reinjections happen
    +			 * in parallel.
    +			 *
    +			 * Annotate this for all queued entries except the
    +			 * first one.
    +			 *
    +			 * As long as the first one is reinjected first it
    +			 * will do the confirmation for us.
    +			 */
    +			entry->nf_ct_is_unconfirmed = ct_is_unconfirmed;
    +		}
    +
     		if (err == 0)
     			err = __nfqnl_enqueue_packet_gso(net, queue,
     							segs, entry);
    -- 
    cgit 1.3-korg
    
    
    
b740e7ddd7ca

netfilter: nfnetlink_queue: do shared-unconfirmed check before segmentation

4 files changed · +150 100
  • include/net/netfilter/nf_queue.h+1 0 modified
    diff --git a/include/net/netfilter/nf_queue.h b/include/net/netfilter/nf_queue.h
    index e6803831d6af51..45eb26b2e95b37 100644
    --- a/include/net/netfilter/nf_queue.h
    +++ b/include/net/netfilter/nf_queue.h
    @@ -21,6 +21,7 @@ struct nf_queue_entry {
     	struct net_device	*physout;
     #endif
     	struct nf_hook_state	state;
    +	bool			nf_ct_is_unconfirmed;
     	u16			size; /* sizeof(entry) + saved route keys */
     	u16			queue_num;
     
    
  • include/net/netfilter/nf_queue.h+1 0 modified
    diff --git a/include/net/netfilter/nf_queue.h b/include/net/netfilter/nf_queue.h
    index e6803831d6af51..45eb26b2e95b37 100644
    --- a/include/net/netfilter/nf_queue.h
    +++ b/include/net/netfilter/nf_queue.h
    @@ -21,6 +21,7 @@ struct nf_queue_entry {
     	struct net_device	*physout;
     #endif
     	struct nf_hook_state	state;
    +	bool			nf_ct_is_unconfirmed;
     	u16			size; /* sizeof(entry) + saved route keys */
     	u16			queue_num;
     
    
  • net/netfilter/nfnetlink_queue.c+74 50 modified
    diff --git a/net/netfilter/nfnetlink_queue.c b/net/netfilter/nfnetlink_queue.c
    index 336e3ad18e72dc..34548213f2f14f 100644
    --- a/net/netfilter/nfnetlink_queue.c
    +++ b/net/netfilter/nfnetlink_queue.c
    @@ -438,6 +438,34 @@ next_hook:
     	nf_queue_entry_free(entry);
     }
     
    +/* return true if the entry has an unconfirmed conntrack attached that isn't owned by us
    + * exclusively.
    + */
    +static bool nf_ct_drop_unconfirmed(const struct nf_queue_entry *entry, bool *is_unconfirmed)
    +{
    +#if IS_ENABLED(CONFIG_NF_CONNTRACK)
    +	struct nf_conn *ct = (void *)skb_nfct(entry->skb);
    +
    +	if (!ct || nf_ct_is_confirmed(ct))
    +		return false;
    +
    +	if (is_unconfirmed)
    +		*is_unconfirmed = true;
    +
    +	/* in some cases skb_clone() can occur after initial conntrack
    +	 * pickup, but conntrack assumes exclusive skb->_nfct ownership for
    +	 * unconfirmed entries.
    +	 *
    +	 * This happens for br_netfilter and with ip multicast routing.
    +	 * This can't be solved with serialization here because one clone
    +	 * could have been queued for local delivery or could be transmitted
    +	 * in parallel on another CPU.
    +	 */
    +	return refcount_read(&ct->ct_general.use) > 1;
    +#endif
    +	return false;
    +}
    +
     static void nfqnl_reinject(struct nf_queue_entry *entry, unsigned int verdict)
     {
     	const struct nf_ct_hook *ct_hook;
    @@ -465,6 +493,24 @@ static void nfqnl_reinject(struct nf_queue_entry *entry, unsigned int verdict)
     			break;
     		}
     	}
    +
    +	if (verdict != NF_DROP && entry->nf_ct_is_unconfirmed) {
    +		/* If first queued segment was already reinjected then
    +		 * there is a good chance the ct entry is now confirmed.
    +		 *
    +		 * Handle the rare cases:
    +		 *  - out-of-order verdict
    +		 *  - threaded userspace reinjecting in parallel
    +		 *  - first segment was dropped
    +		 *
    +		 * In all of those cases we can't handle this packet
    +		 * because we can't be sure that another CPU won't modify
    +		 * nf_conn->ext in parallel which isn't allowed.
    +		 */
    +		if (nf_ct_drop_unconfirmed(entry, NULL))
    +			verdict = NF_DROP;
    +	}
    +
     	nf_reinject(entry, verdict);
     }
     
    @@ -894,49 +940,6 @@ nlmsg_failure:
     	return NULL;
     }
     
    -static bool nf_ct_drop_unconfirmed(const struct nf_queue_entry *entry)
    -{
    -#if IS_ENABLED(CONFIG_NF_CONNTRACK)
    -	static const unsigned long flags = IPS_CONFIRMED | IPS_DYING;
    -	struct nf_conn *ct = (void *)skb_nfct(entry->skb);
    -	unsigned long status;
    -	unsigned int use;
    -
    -	if (!ct)
    -		return false;
    -
    -	status = READ_ONCE(ct->status);
    -	if ((status & flags) == IPS_DYING)
    -		return true;
    -
    -	if (status & IPS_CONFIRMED)
    -		return false;
    -
    -	/* in some cases skb_clone() can occur after initial conntrack
    -	 * pickup, but conntrack assumes exclusive skb->_nfct ownership for
    -	 * unconfirmed entries.
    -	 *
    -	 * This happens for br_netfilter and with ip multicast routing.
    -	 * We can't be solved with serialization here because one clone could
    -	 * have been queued for local delivery.
    -	 */
    -	use = refcount_read(&ct->ct_general.use);
    -	if (likely(use == 1))
    -		return false;
    -
    -	/* Can't decrement further? Exclusive ownership. */
    -	if (!refcount_dec_not_one(&ct->ct_general.use))
    -		return false;
    -
    -	skb_set_nfct(entry->skb, 0);
    -	/* No nf_ct_put(): we already decremented .use and it cannot
    -	 * drop down to 0.
    -	 */
    -	return true;
    -#endif
    -	return false;
    -}
    -
     static int
     __nfqnl_enqueue_packet(struct net *net, struct nfqnl_instance *queue,
     			struct nf_queue_entry *entry)
    @@ -953,9 +956,6 @@ __nfqnl_enqueue_packet(struct net *net, struct nfqnl_instance *queue,
     	}
     	spin_lock_bh(&queue->lock);
     
    -	if (nf_ct_drop_unconfirmed(entry))
    -		goto err_out_free_nskb;
    -
     	if (queue->queue_total >= queue->queue_maxlen)
     		goto err_out_queue_drop;
     
    @@ -998,7 +998,6 @@ err_out_queue_drop:
     		else
     			net_warn_ratelimited("nf_queue: hash insert failed: %d\n", err);
     	}
    -err_out_free_nskb:
     	kfree_skb(nskb);
     err_out_unlock:
     	spin_unlock_bh(&queue->lock);
    @@ -1077,9 +1076,10 @@ __nfqnl_enqueue_packet_gso(struct net *net, struct nfqnl_instance *queue,
     static int
     nfqnl_enqueue_packet(struct nf_queue_entry *entry, unsigned int queuenum)
     {
    -	unsigned int queued;
    -	struct nfqnl_instance *queue;
     	struct sk_buff *skb, *segs, *nskb;
    +	bool ct_is_unconfirmed = false;
    +	struct nfqnl_instance *queue;
    +	unsigned int queued;
     	int err = -ENOBUFS;
     	struct net *net = entry->state.net;
     	struct nfnl_queue_net *q = nfnl_queue_pernet(net);
    @@ -1103,6 +1103,15 @@ nfqnl_enqueue_packet(struct nf_queue_entry *entry, unsigned int queuenum)
     		break;
     	}
     
    +	/* Check if someone already holds another reference to
    +	 * unconfirmed ct.  If so, we cannot queue the skb:
    +	 * concurrent modifications of nf_conn->ext are not
    +	 * allowed and we can't know if another CPU isn't
    +	 * processing the same nf_conn entry in parallel.
    +	 */
    +	if (nf_ct_drop_unconfirmed(entry, &ct_is_unconfirmed))
    +		return -EINVAL;
    +
     	if (!skb_is_gso(skb) || ((queue->flags & NFQA_CFG_F_GSO) && !skb_is_gso_sctp(skb)))
     		return __nfqnl_enqueue_packet(net, queue, entry);
     
    @@ -1116,7 +1125,23 @@ nfqnl_enqueue_packet(struct nf_queue_entry *entry, unsigned int queuenum)
     		goto out_err;
     	queued = 0;
     	err = 0;
    +
     	skb_list_walk_safe(segs, segs, nskb) {
    +		if (ct_is_unconfirmed && queued > 0) {
    +			/* skb_gso_segment() increments the ct refcount.
    +			 * This is a problem for unconfirmed (not in hash)
    +			 * entries, those can race when reinjections happen
    +			 * in parallel.
    +			 *
    +			 * Annotate this for all queued entries except the
    +			 * first one.
    +			 *
    +			 * As long as the first one is reinjected first it
    +			 * will do the confirmation for us.
    +			 */
    +			entry->nf_ct_is_unconfirmed = ct_is_unconfirmed;
    +		}
    +
     		if (err == 0)
     			err = __nfqnl_enqueue_packet_gso(net, queue,
     							segs, entry);
    -- 
    cgit 1.3-korg
    
    
    
  • net/netfilter/nfnetlink_queue.c+74 50 modified
    diff --git a/net/netfilter/nfnetlink_queue.c b/net/netfilter/nfnetlink_queue.c
    index 336e3ad18e72dc..34548213f2f14f 100644
    --- a/net/netfilter/nfnetlink_queue.c
    +++ b/net/netfilter/nfnetlink_queue.c
    @@ -438,6 +438,34 @@ next_hook:
     	nf_queue_entry_free(entry);
     }
     
    +/* return true if the entry has an unconfirmed conntrack attached that isn't owned by us
    + * exclusively.
    + */
    +static bool nf_ct_drop_unconfirmed(const struct nf_queue_entry *entry, bool *is_unconfirmed)
    +{
    +#if IS_ENABLED(CONFIG_NF_CONNTRACK)
    +	struct nf_conn *ct = (void *)skb_nfct(entry->skb);
    +
    +	if (!ct || nf_ct_is_confirmed(ct))
    +		return false;
    +
    +	if (is_unconfirmed)
    +		*is_unconfirmed = true;
    +
    +	/* in some cases skb_clone() can occur after initial conntrack
    +	 * pickup, but conntrack assumes exclusive skb->_nfct ownership for
    +	 * unconfirmed entries.
    +	 *
    +	 * This happens for br_netfilter and with ip multicast routing.
    +	 * This can't be solved with serialization here because one clone
    +	 * could have been queued for local delivery or could be transmitted
    +	 * in parallel on another CPU.
    +	 */
    +	return refcount_read(&ct->ct_general.use) > 1;
    +#endif
    +	return false;
    +}
    +
     static void nfqnl_reinject(struct nf_queue_entry *entry, unsigned int verdict)
     {
     	const struct nf_ct_hook *ct_hook;
    @@ -465,6 +493,24 @@ static void nfqnl_reinject(struct nf_queue_entry *entry, unsigned int verdict)
     			break;
     		}
     	}
    +
    +	if (verdict != NF_DROP && entry->nf_ct_is_unconfirmed) {
    +		/* If first queued segment was already reinjected then
    +		 * there is a good chance the ct entry is now confirmed.
    +		 *
    +		 * Handle the rare cases:
    +		 *  - out-of-order verdict
    +		 *  - threaded userspace reinjecting in parallel
    +		 *  - first segment was dropped
    +		 *
    +		 * In all of those cases we can't handle this packet
    +		 * because we can't be sure that another CPU won't modify
    +		 * nf_conn->ext in parallel which isn't allowed.
    +		 */
    +		if (nf_ct_drop_unconfirmed(entry, NULL))
    +			verdict = NF_DROP;
    +	}
    +
     	nf_reinject(entry, verdict);
     }
     
    @@ -894,49 +940,6 @@ nlmsg_failure:
     	return NULL;
     }
     
    -static bool nf_ct_drop_unconfirmed(const struct nf_queue_entry *entry)
    -{
    -#if IS_ENABLED(CONFIG_NF_CONNTRACK)
    -	static const unsigned long flags = IPS_CONFIRMED | IPS_DYING;
    -	struct nf_conn *ct = (void *)skb_nfct(entry->skb);
    -	unsigned long status;
    -	unsigned int use;
    -
    -	if (!ct)
    -		return false;
    -
    -	status = READ_ONCE(ct->status);
    -	if ((status & flags) == IPS_DYING)
    -		return true;
    -
    -	if (status & IPS_CONFIRMED)
    -		return false;
    -
    -	/* in some cases skb_clone() can occur after initial conntrack
    -	 * pickup, but conntrack assumes exclusive skb->_nfct ownership for
    -	 * unconfirmed entries.
    -	 *
    -	 * This happens for br_netfilter and with ip multicast routing.
    -	 * We can't be solved with serialization here because one clone could
    -	 * have been queued for local delivery.
    -	 */
    -	use = refcount_read(&ct->ct_general.use);
    -	if (likely(use == 1))
    -		return false;
    -
    -	/* Can't decrement further? Exclusive ownership. */
    -	if (!refcount_dec_not_one(&ct->ct_general.use))
    -		return false;
    -
    -	skb_set_nfct(entry->skb, 0);
    -	/* No nf_ct_put(): we already decremented .use and it cannot
    -	 * drop down to 0.
    -	 */
    -	return true;
    -#endif
    -	return false;
    -}
    -
     static int
     __nfqnl_enqueue_packet(struct net *net, struct nfqnl_instance *queue,
     			struct nf_queue_entry *entry)
    @@ -953,9 +956,6 @@ __nfqnl_enqueue_packet(struct net *net, struct nfqnl_instance *queue,
     	}
     	spin_lock_bh(&queue->lock);
     
    -	if (nf_ct_drop_unconfirmed(entry))
    -		goto err_out_free_nskb;
    -
     	if (queue->queue_total >= queue->queue_maxlen)
     		goto err_out_queue_drop;
     
    @@ -998,7 +998,6 @@ err_out_queue_drop:
     		else
     			net_warn_ratelimited("nf_queue: hash insert failed: %d\n", err);
     	}
    -err_out_free_nskb:
     	kfree_skb(nskb);
     err_out_unlock:
     	spin_unlock_bh(&queue->lock);
    @@ -1077,9 +1076,10 @@ __nfqnl_enqueue_packet_gso(struct net *net, struct nfqnl_instance *queue,
     static int
     nfqnl_enqueue_packet(struct nf_queue_entry *entry, unsigned int queuenum)
     {
    -	unsigned int queued;
    -	struct nfqnl_instance *queue;
     	struct sk_buff *skb, *segs, *nskb;
    +	bool ct_is_unconfirmed = false;
    +	struct nfqnl_instance *queue;
    +	unsigned int queued;
     	int err = -ENOBUFS;
     	struct net *net = entry->state.net;
     	struct nfnl_queue_net *q = nfnl_queue_pernet(net);
    @@ -1103,6 +1103,15 @@ nfqnl_enqueue_packet(struct nf_queue_entry *entry, unsigned int queuenum)
     		break;
     	}
     
    +	/* Check if someone already holds another reference to
    +	 * unconfirmed ct.  If so, we cannot queue the skb:
    +	 * concurrent modifications of nf_conn->ext are not
    +	 * allowed and we can't know if another CPU isn't
    +	 * processing the same nf_conn entry in parallel.
    +	 */
    +	if (nf_ct_drop_unconfirmed(entry, &ct_is_unconfirmed))
    +		return -EINVAL;
    +
     	if (!skb_is_gso(skb) || ((queue->flags & NFQA_CFG_F_GSO) && !skb_is_gso_sctp(skb)))
     		return __nfqnl_enqueue_packet(net, queue, entry);
     
    @@ -1116,7 +1125,23 @@ nfqnl_enqueue_packet(struct nf_queue_entry *entry, unsigned int queuenum)
     		goto out_err;
     	queued = 0;
     	err = 0;
    +
     	skb_list_walk_safe(segs, segs, nskb) {
    +		if (ct_is_unconfirmed && queued > 0) {
    +			/* skb_gso_segment() increments the ct refcount.
    +			 * This is a problem for unconfirmed (not in hash)
    +			 * entries, those can race when reinjections happen
    +			 * in parallel.
    +			 *
    +			 * Annotate this for all queued entries except the
    +			 * first one.
    +			 *
    +			 * As long as the first one is reinjected first it
    +			 * will do the confirmation for us.
    +			 */
    +			entry->nf_ct_is_unconfirmed = ct_is_unconfirmed;
    +		}
    +
     		if (err == 0)
     			err = __nfqnl_enqueue_packet_gso(net, queue,
     							segs, entry);
    -- 
    cgit 1.3-korg
    
    
    
79b713ef4261

netfilter: nfnetlink_queue: do shared-unconfirmed check before segmentation

4 files changed · +150 100
  • include/net/netfilter/nf_queue.h+1 0 modified
    diff --git a/include/net/netfilter/nf_queue.h b/include/net/netfilter/nf_queue.h
    index e6803831d6af51..45eb26b2e95b37 100644
    --- a/include/net/netfilter/nf_queue.h
    +++ b/include/net/netfilter/nf_queue.h
    @@ -21,6 +21,7 @@ struct nf_queue_entry {
     	struct net_device	*physout;
     #endif
     	struct nf_hook_state	state;
    +	bool			nf_ct_is_unconfirmed;
     	u16			size; /* sizeof(entry) + saved route keys */
     	u16			queue_num;
     
    
  • include/net/netfilter/nf_queue.h+1 0 modified
    diff --git a/include/net/netfilter/nf_queue.h b/include/net/netfilter/nf_queue.h
    index e6803831d6af51..45eb26b2e95b37 100644
    --- a/include/net/netfilter/nf_queue.h
    +++ b/include/net/netfilter/nf_queue.h
    @@ -21,6 +21,7 @@ struct nf_queue_entry {
     	struct net_device	*physout;
     #endif
     	struct nf_hook_state	state;
    +	bool			nf_ct_is_unconfirmed;
     	u16			size; /* sizeof(entry) + saved route keys */
     	u16			queue_num;
     
    
  • net/netfilter/nfnetlink_queue.c+74 50 modified
    diff --git a/net/netfilter/nfnetlink_queue.c b/net/netfilter/nfnetlink_queue.c
    index fb074e95a767d9..af35dbc19864a0 100644
    --- a/net/netfilter/nfnetlink_queue.c
    +++ b/net/netfilter/nfnetlink_queue.c
    @@ -438,6 +438,34 @@ next_hook:
     	nf_queue_entry_free(entry);
     }
     
    +/* return true if the entry has an unconfirmed conntrack attached that isn't owned by us
    + * exclusively.
    + */
    +static bool nf_ct_drop_unconfirmed(const struct nf_queue_entry *entry, bool *is_unconfirmed)
    +{
    +#if IS_ENABLED(CONFIG_NF_CONNTRACK)
    +	struct nf_conn *ct = (void *)skb_nfct(entry->skb);
    +
    +	if (!ct || nf_ct_is_confirmed(ct))
    +		return false;
    +
    +	if (is_unconfirmed)
    +		*is_unconfirmed = true;
    +
    +	/* in some cases skb_clone() can occur after initial conntrack
    +	 * pickup, but conntrack assumes exclusive skb->_nfct ownership for
    +	 * unconfirmed entries.
    +	 *
    +	 * This happens for br_netfilter and with ip multicast routing.
    +	 * This can't be solved with serialization here because one clone
    +	 * could have been queued for local delivery or could be transmitted
    +	 * in parallel on another CPU.
    +	 */
    +	return refcount_read(&ct->ct_general.use) > 1;
    +#endif
    +	return false;
    +}
    +
     static void nfqnl_reinject(struct nf_queue_entry *entry, unsigned int verdict)
     {
     	const struct nf_ct_hook *ct_hook;
    @@ -465,6 +493,24 @@ static void nfqnl_reinject(struct nf_queue_entry *entry, unsigned int verdict)
     			break;
     		}
     	}
    +
    +	if (verdict != NF_DROP && entry->nf_ct_is_unconfirmed) {
    +		/* If first queued segment was already reinjected then
    +		 * there is a good chance the ct entry is now confirmed.
    +		 *
    +		 * Handle the rare cases:
    +		 *  - out-of-order verdict
    +		 *  - threaded userspace reinjecting in parallel
    +		 *  - first segment was dropped
    +		 *
    +		 * In all of those cases we can't handle this packet
    +		 * because we can't be sure that another CPU won't modify
    +		 * nf_conn->ext in parallel which isn't allowed.
    +		 */
    +		if (nf_ct_drop_unconfirmed(entry, NULL))
    +			verdict = NF_DROP;
    +	}
    +
     	nf_reinject(entry, verdict);
     }
     
    @@ -892,49 +938,6 @@ nlmsg_failure:
     	return NULL;
     }
     
    -static bool nf_ct_drop_unconfirmed(const struct nf_queue_entry *entry)
    -{
    -#if IS_ENABLED(CONFIG_NF_CONNTRACK)
    -	static const unsigned long flags = IPS_CONFIRMED | IPS_DYING;
    -	struct nf_conn *ct = (void *)skb_nfct(entry->skb);
    -	unsigned long status;
    -	unsigned int use;
    -
    -	if (!ct)
    -		return false;
    -
    -	status = READ_ONCE(ct->status);
    -	if ((status & flags) == IPS_DYING)
    -		return true;
    -
    -	if (status & IPS_CONFIRMED)
    -		return false;
    -
    -	/* in some cases skb_clone() can occur after initial conntrack
    -	 * pickup, but conntrack assumes exclusive skb->_nfct ownership for
    -	 * unconfirmed entries.
    -	 *
    -	 * This happens for br_netfilter and with ip multicast routing.
    -	 * We can't be solved with serialization here because one clone could
    -	 * have been queued for local delivery.
    -	 */
    -	use = refcount_read(&ct->ct_general.use);
    -	if (likely(use == 1))
    -		return false;
    -
    -	/* Can't decrement further? Exclusive ownership. */
    -	if (!refcount_dec_not_one(&ct->ct_general.use))
    -		return false;
    -
    -	skb_set_nfct(entry->skb, 0);
    -	/* No nf_ct_put(): we already decremented .use and it cannot
    -	 * drop down to 0.
    -	 */
    -	return true;
    -#endif
    -	return false;
    -}
    -
     static int
     __nfqnl_enqueue_packet(struct net *net, struct nfqnl_instance *queue,
     			struct nf_queue_entry *entry)
    @@ -951,9 +954,6 @@ __nfqnl_enqueue_packet(struct net *net, struct nfqnl_instance *queue,
     	}
     	spin_lock_bh(&queue->lock);
     
    -	if (nf_ct_drop_unconfirmed(entry))
    -		goto err_out_free_nskb;
    -
     	if (queue->queue_total >= queue->queue_maxlen)
     		goto err_out_queue_drop;
     
    @@ -996,7 +996,6 @@ err_out_queue_drop:
     		else
     			net_warn_ratelimited("nf_queue: hash insert failed: %d\n", err);
     	}
    -err_out_free_nskb:
     	kfree_skb(nskb);
     err_out_unlock:
     	spin_unlock_bh(&queue->lock);
    @@ -1075,9 +1074,10 @@ __nfqnl_enqueue_packet_gso(struct net *net, struct nfqnl_instance *queue,
     static int
     nfqnl_enqueue_packet(struct nf_queue_entry *entry, unsigned int queuenum)
     {
    -	unsigned int queued;
    -	struct nfqnl_instance *queue;
     	struct sk_buff *skb, *segs, *nskb;
    +	bool ct_is_unconfirmed = false;
    +	struct nfqnl_instance *queue;
    +	unsigned int queued;
     	int err = -ENOBUFS;
     	struct net *net = entry->state.net;
     	struct nfnl_queue_net *q = nfnl_queue_pernet(net);
    @@ -1101,6 +1101,15 @@ nfqnl_enqueue_packet(struct nf_queue_entry *entry, unsigned int queuenum)
     		break;
     	}
     
    +	/* Check if someone already holds another reference to
    +	 * unconfirmed ct.  If so, we cannot queue the skb:
    +	 * concurrent modifications of nf_conn->ext are not
    +	 * allowed and we can't know if another CPU isn't
    +	 * processing the same nf_conn entry in parallel.
    +	 */
    +	if (nf_ct_drop_unconfirmed(entry, &ct_is_unconfirmed))
    +		return -EINVAL;
    +
     	if (!skb_is_gso(skb) || ((queue->flags & NFQA_CFG_F_GSO) && !skb_is_gso_sctp(skb)))
     		return __nfqnl_enqueue_packet(net, queue, entry);
     
    @@ -1114,7 +1123,23 @@ nfqnl_enqueue_packet(struct nf_queue_entry *entry, unsigned int queuenum)
     		goto out_err;
     	queued = 0;
     	err = 0;
    +
     	skb_list_walk_safe(segs, segs, nskb) {
    +		if (ct_is_unconfirmed && queued > 0) {
    +			/* skb_gso_segment() increments the ct refcount.
    +			 * This is a problem for unconfirmed (not in hash)
    +			 * entries, those can race when reinjections happen
    +			 * in parallel.
    +			 *
    +			 * Annotate this for all queued entries except the
    +			 * first one.
    +			 *
    +			 * As long as the first one is reinjected first it
    +			 * will do the confirmation for us.
    +			 */
    +			entry->nf_ct_is_unconfirmed = ct_is_unconfirmed;
    +		}
    +
     		if (err == 0)
     			err = __nfqnl_enqueue_packet_gso(net, queue,
     							segs, entry);
    -- 
    cgit 1.3-korg
    
    
    
  • net/netfilter/nfnetlink_queue.c+74 50 modified
    diff --git a/net/netfilter/nfnetlink_queue.c b/net/netfilter/nfnetlink_queue.c
    index fb074e95a767d9..af35dbc19864a0 100644
    --- a/net/netfilter/nfnetlink_queue.c
    +++ b/net/netfilter/nfnetlink_queue.c
    @@ -438,6 +438,34 @@ next_hook:
     	nf_queue_entry_free(entry);
     }
     
    +/* return true if the entry has an unconfirmed conntrack attached that isn't owned by us
    + * exclusively.
    + */
    +static bool nf_ct_drop_unconfirmed(const struct nf_queue_entry *entry, bool *is_unconfirmed)
    +{
    +#if IS_ENABLED(CONFIG_NF_CONNTRACK)
    +	struct nf_conn *ct = (void *)skb_nfct(entry->skb);
    +
    +	if (!ct || nf_ct_is_confirmed(ct))
    +		return false;
    +
    +	if (is_unconfirmed)
    +		*is_unconfirmed = true;
    +
    +	/* in some cases skb_clone() can occur after initial conntrack
    +	 * pickup, but conntrack assumes exclusive skb->_nfct ownership for
    +	 * unconfirmed entries.
    +	 *
    +	 * This happens for br_netfilter and with ip multicast routing.
    +	 * This can't be solved with serialization here because one clone
    +	 * could have been queued for local delivery or could be transmitted
    +	 * in parallel on another CPU.
    +	 */
    +	return refcount_read(&ct->ct_general.use) > 1;
    +#endif
    +	return false;
    +}
    +
     static void nfqnl_reinject(struct nf_queue_entry *entry, unsigned int verdict)
     {
     	const struct nf_ct_hook *ct_hook;
    @@ -465,6 +493,24 @@ static void nfqnl_reinject(struct nf_queue_entry *entry, unsigned int verdict)
     			break;
     		}
     	}
    +
    +	if (verdict != NF_DROP && entry->nf_ct_is_unconfirmed) {
    +		/* If first queued segment was already reinjected then
    +		 * there is a good chance the ct entry is now confirmed.
    +		 *
    +		 * Handle the rare cases:
    +		 *  - out-of-order verdict
    +		 *  - threaded userspace reinjecting in parallel
    +		 *  - first segment was dropped
    +		 *
    +		 * In all of those cases we can't handle this packet
    +		 * because we can't be sure that another CPU won't modify
    +		 * nf_conn->ext in parallel which isn't allowed.
    +		 */
    +		if (nf_ct_drop_unconfirmed(entry, NULL))
    +			verdict = NF_DROP;
    +	}
    +
     	nf_reinject(entry, verdict);
     }
     
    @@ -892,49 +938,6 @@ nlmsg_failure:
     	return NULL;
     }
     
    -static bool nf_ct_drop_unconfirmed(const struct nf_queue_entry *entry)
    -{
    -#if IS_ENABLED(CONFIG_NF_CONNTRACK)
    -	static const unsigned long flags = IPS_CONFIRMED | IPS_DYING;
    -	struct nf_conn *ct = (void *)skb_nfct(entry->skb);
    -	unsigned long status;
    -	unsigned int use;
    -
    -	if (!ct)
    -		return false;
    -
    -	status = READ_ONCE(ct->status);
    -	if ((status & flags) == IPS_DYING)
    -		return true;
    -
    -	if (status & IPS_CONFIRMED)
    -		return false;
    -
    -	/* in some cases skb_clone() can occur after initial conntrack
    -	 * pickup, but conntrack assumes exclusive skb->_nfct ownership for
    -	 * unconfirmed entries.
    -	 *
    -	 * This happens for br_netfilter and with ip multicast routing.
    -	 * We can't be solved with serialization here because one clone could
    -	 * have been queued for local delivery.
    -	 */
    -	use = refcount_read(&ct->ct_general.use);
    -	if (likely(use == 1))
    -		return false;
    -
    -	/* Can't decrement further? Exclusive ownership. */
    -	if (!refcount_dec_not_one(&ct->ct_general.use))
    -		return false;
    -
    -	skb_set_nfct(entry->skb, 0);
    -	/* No nf_ct_put(): we already decremented .use and it cannot
    -	 * drop down to 0.
    -	 */
    -	return true;
    -#endif
    -	return false;
    -}
    -
     static int
     __nfqnl_enqueue_packet(struct net *net, struct nfqnl_instance *queue,
     			struct nf_queue_entry *entry)
    @@ -951,9 +954,6 @@ __nfqnl_enqueue_packet(struct net *net, struct nfqnl_instance *queue,
     	}
     	spin_lock_bh(&queue->lock);
     
    -	if (nf_ct_drop_unconfirmed(entry))
    -		goto err_out_free_nskb;
    -
     	if (queue->queue_total >= queue->queue_maxlen)
     		goto err_out_queue_drop;
     
    @@ -996,7 +996,6 @@ err_out_queue_drop:
     		else
     			net_warn_ratelimited("nf_queue: hash insert failed: %d\n", err);
     	}
    -err_out_free_nskb:
     	kfree_skb(nskb);
     err_out_unlock:
     	spin_unlock_bh(&queue->lock);
    @@ -1075,9 +1074,10 @@ __nfqnl_enqueue_packet_gso(struct net *net, struct nfqnl_instance *queue,
     static int
     nfqnl_enqueue_packet(struct nf_queue_entry *entry, unsigned int queuenum)
     {
    -	unsigned int queued;
    -	struct nfqnl_instance *queue;
     	struct sk_buff *skb, *segs, *nskb;
    +	bool ct_is_unconfirmed = false;
    +	struct nfqnl_instance *queue;
    +	unsigned int queued;
     	int err = -ENOBUFS;
     	struct net *net = entry->state.net;
     	struct nfnl_queue_net *q = nfnl_queue_pernet(net);
    @@ -1101,6 +1101,15 @@ nfqnl_enqueue_packet(struct nf_queue_entry *entry, unsigned int queuenum)
     		break;
     	}
     
    +	/* Check if someone already holds another reference to
    +	 * unconfirmed ct.  If so, we cannot queue the skb:
    +	 * concurrent modifications of nf_conn->ext are not
    +	 * allowed and we can't know if another CPU isn't
    +	 * processing the same nf_conn entry in parallel.
    +	 */
    +	if (nf_ct_drop_unconfirmed(entry, &ct_is_unconfirmed))
    +		return -EINVAL;
    +
     	if (!skb_is_gso(skb) || ((queue->flags & NFQA_CFG_F_GSO) && !skb_is_gso_sctp(skb)))
     		return __nfqnl_enqueue_packet(net, queue, entry);
     
    @@ -1114,7 +1123,23 @@ nfqnl_enqueue_packet(struct nf_queue_entry *entry, unsigned int queuenum)
     		goto out_err;
     	queued = 0;
     	err = 0;
    +
     	skb_list_walk_safe(segs, segs, nskb) {
    +		if (ct_is_unconfirmed && queued > 0) {
    +			/* skb_gso_segment() increments the ct refcount.
    +			 * This is a problem for unconfirmed (not in hash)
    +			 * entries, those can race when reinjections happen
    +			 * in parallel.
    +			 *
    +			 * Annotate this for all queued entries except the
    +			 * first one.
    +			 *
    +			 * As long as the first one is reinjected first it
    +			 * will do the confirmation for us.
    +			 */
    +			entry->nf_ct_is_unconfirmed = ct_is_unconfirmed;
    +		}
    +
     		if (err == 0)
     			err = __nfqnl_enqueue_packet_gso(net, queue,
     							segs, entry);
    -- 
    cgit 1.3-korg
    
    
    
23901aa6b8a2

netfilter: nfnetlink_queue: do shared-unconfirmed check before segmentation

4 files changed · +150 100
  • include/net/netfilter/nf_queue.h+1 0 modified
    diff --git a/include/net/netfilter/nf_queue.h b/include/net/netfilter/nf_queue.h
    index e6803831d6af51..45eb26b2e95b37 100644
    --- a/include/net/netfilter/nf_queue.h
    +++ b/include/net/netfilter/nf_queue.h
    @@ -21,6 +21,7 @@ struct nf_queue_entry {
     	struct net_device	*physout;
     #endif
     	struct nf_hook_state	state;
    +	bool			nf_ct_is_unconfirmed;
     	u16			size; /* sizeof(entry) + saved route keys */
     	u16			queue_num;
     
    
  • include/net/netfilter/nf_queue.h+1 0 modified
    diff --git a/include/net/netfilter/nf_queue.h b/include/net/netfilter/nf_queue.h
    index e6803831d6af51..45eb26b2e95b37 100644
    --- a/include/net/netfilter/nf_queue.h
    +++ b/include/net/netfilter/nf_queue.h
    @@ -21,6 +21,7 @@ struct nf_queue_entry {
     	struct net_device	*physout;
     #endif
     	struct nf_hook_state	state;
    +	bool			nf_ct_is_unconfirmed;
     	u16			size; /* sizeof(entry) + saved route keys */
     	u16			queue_num;
     
    
  • net/netfilter/nfnetlink_queue.c+74 50 modified
    diff --git a/net/netfilter/nfnetlink_queue.c b/net/netfilter/nfnetlink_queue.c
    index 336e3ad18e72dc..34548213f2f14f 100644
    --- a/net/netfilter/nfnetlink_queue.c
    +++ b/net/netfilter/nfnetlink_queue.c
    @@ -438,6 +438,34 @@ next_hook:
     	nf_queue_entry_free(entry);
     }
     
    +/* return true if the entry has an unconfirmed conntrack attached that isn't owned by us
    + * exclusively.
    + */
    +static bool nf_ct_drop_unconfirmed(const struct nf_queue_entry *entry, bool *is_unconfirmed)
    +{
    +#if IS_ENABLED(CONFIG_NF_CONNTRACK)
    +	struct nf_conn *ct = (void *)skb_nfct(entry->skb);
    +
    +	if (!ct || nf_ct_is_confirmed(ct))
    +		return false;
    +
    +	if (is_unconfirmed)
    +		*is_unconfirmed = true;
    +
    +	/* in some cases skb_clone() can occur after initial conntrack
    +	 * pickup, but conntrack assumes exclusive skb->_nfct ownership for
    +	 * unconfirmed entries.
    +	 *
    +	 * This happens for br_netfilter and with ip multicast routing.
    +	 * This can't be solved with serialization here because one clone
    +	 * could have been queued for local delivery or could be transmitted
    +	 * in parallel on another CPU.
    +	 */
    +	return refcount_read(&ct->ct_general.use) > 1;
    +#endif
    +	return false;
    +}
    +
     static void nfqnl_reinject(struct nf_queue_entry *entry, unsigned int verdict)
     {
     	const struct nf_ct_hook *ct_hook;
    @@ -465,6 +493,24 @@ static void nfqnl_reinject(struct nf_queue_entry *entry, unsigned int verdict)
     			break;
     		}
     	}
    +
    +	if (verdict != NF_DROP && entry->nf_ct_is_unconfirmed) {
    +		/* If first queued segment was already reinjected then
    +		 * there is a good chance the ct entry is now confirmed.
    +		 *
    +		 * Handle the rare cases:
    +		 *  - out-of-order verdict
    +		 *  - threaded userspace reinjecting in parallel
    +		 *  - first segment was dropped
    +		 *
    +		 * In all of those cases we can't handle this packet
    +		 * because we can't be sure that another CPU won't modify
    +		 * nf_conn->ext in parallel which isn't allowed.
    +		 */
    +		if (nf_ct_drop_unconfirmed(entry, NULL))
    +			verdict = NF_DROP;
    +	}
    +
     	nf_reinject(entry, verdict);
     }
     
    @@ -894,49 +940,6 @@ nlmsg_failure:
     	return NULL;
     }
     
    -static bool nf_ct_drop_unconfirmed(const struct nf_queue_entry *entry)
    -{
    -#if IS_ENABLED(CONFIG_NF_CONNTRACK)
    -	static const unsigned long flags = IPS_CONFIRMED | IPS_DYING;
    -	struct nf_conn *ct = (void *)skb_nfct(entry->skb);
    -	unsigned long status;
    -	unsigned int use;
    -
    -	if (!ct)
    -		return false;
    -
    -	status = READ_ONCE(ct->status);
    -	if ((status & flags) == IPS_DYING)
    -		return true;
    -
    -	if (status & IPS_CONFIRMED)
    -		return false;
    -
    -	/* in some cases skb_clone() can occur after initial conntrack
    -	 * pickup, but conntrack assumes exclusive skb->_nfct ownership for
    -	 * unconfirmed entries.
    -	 *
    -	 * This happens for br_netfilter and with ip multicast routing.
    -	 * We can't be solved with serialization here because one clone could
    -	 * have been queued for local delivery.
    -	 */
    -	use = refcount_read(&ct->ct_general.use);
    -	if (likely(use == 1))
    -		return false;
    -
    -	/* Can't decrement further? Exclusive ownership. */
    -	if (!refcount_dec_not_one(&ct->ct_general.use))
    -		return false;
    -
    -	skb_set_nfct(entry->skb, 0);
    -	/* No nf_ct_put(): we already decremented .use and it cannot
    -	 * drop down to 0.
    -	 */
    -	return true;
    -#endif
    -	return false;
    -}
    -
     static int
     __nfqnl_enqueue_packet(struct net *net, struct nfqnl_instance *queue,
     			struct nf_queue_entry *entry)
    @@ -953,9 +956,6 @@ __nfqnl_enqueue_packet(struct net *net, struct nfqnl_instance *queue,
     	}
     	spin_lock_bh(&queue->lock);
     
    -	if (nf_ct_drop_unconfirmed(entry))
    -		goto err_out_free_nskb;
    -
     	if (queue->queue_total >= queue->queue_maxlen)
     		goto err_out_queue_drop;
     
    @@ -998,7 +998,6 @@ err_out_queue_drop:
     		else
     			net_warn_ratelimited("nf_queue: hash insert failed: %d\n", err);
     	}
    -err_out_free_nskb:
     	kfree_skb(nskb);
     err_out_unlock:
     	spin_unlock_bh(&queue->lock);
    @@ -1077,9 +1076,10 @@ __nfqnl_enqueue_packet_gso(struct net *net, struct nfqnl_instance *queue,
     static int
     nfqnl_enqueue_packet(struct nf_queue_entry *entry, unsigned int queuenum)
     {
    -	unsigned int queued;
    -	struct nfqnl_instance *queue;
     	struct sk_buff *skb, *segs, *nskb;
    +	bool ct_is_unconfirmed = false;
    +	struct nfqnl_instance *queue;
    +	unsigned int queued;
     	int err = -ENOBUFS;
     	struct net *net = entry->state.net;
     	struct nfnl_queue_net *q = nfnl_queue_pernet(net);
    @@ -1103,6 +1103,15 @@ nfqnl_enqueue_packet(struct nf_queue_entry *entry, unsigned int queuenum)
     		break;
     	}
     
    +	/* Check if someone already holds another reference to
    +	 * unconfirmed ct.  If so, we cannot queue the skb:
    +	 * concurrent modifications of nf_conn->ext are not
    +	 * allowed and we can't know if another CPU isn't
    +	 * processing the same nf_conn entry in parallel.
    +	 */
    +	if (nf_ct_drop_unconfirmed(entry, &ct_is_unconfirmed))
    +		return -EINVAL;
    +
     	if (!skb_is_gso(skb) || ((queue->flags & NFQA_CFG_F_GSO) && !skb_is_gso_sctp(skb)))
     		return __nfqnl_enqueue_packet(net, queue, entry);
     
    @@ -1116,7 +1125,23 @@ nfqnl_enqueue_packet(struct nf_queue_entry *entry, unsigned int queuenum)
     		goto out_err;
     	queued = 0;
     	err = 0;
    +
     	skb_list_walk_safe(segs, segs, nskb) {
    +		if (ct_is_unconfirmed && queued > 0) {
    +			/* skb_gso_segment() increments the ct refcount.
    +			 * This is a problem for unconfirmed (not in hash)
    +			 * entries, those can race when reinjections happen
    +			 * in parallel.
    +			 *
    +			 * Annotate this for all queued entries except the
    +			 * first one.
    +			 *
    +			 * As long as the first one is reinjected first it
    +			 * will do the confirmation for us.
    +			 */
    +			entry->nf_ct_is_unconfirmed = ct_is_unconfirmed;
    +		}
    +
     		if (err == 0)
     			err = __nfqnl_enqueue_packet_gso(net, queue,
     							segs, entry);
    -- 
    cgit 1.3-korg
    
    
    
  • net/netfilter/nfnetlink_queue.c+74 50 modified
    diff --git a/net/netfilter/nfnetlink_queue.c b/net/netfilter/nfnetlink_queue.c
    index 336e3ad18e72dc..34548213f2f14f 100644
    --- a/net/netfilter/nfnetlink_queue.c
    +++ b/net/netfilter/nfnetlink_queue.c
    @@ -438,6 +438,34 @@ next_hook:
     	nf_queue_entry_free(entry);
     }
     
    +/* return true if the entry has an unconfirmed conntrack attached that isn't owned by us
    + * exclusively.
    + */
    +static bool nf_ct_drop_unconfirmed(const struct nf_queue_entry *entry, bool *is_unconfirmed)
    +{
    +#if IS_ENABLED(CONFIG_NF_CONNTRACK)
    +	struct nf_conn *ct = (void *)skb_nfct(entry->skb);
    +
    +	if (!ct || nf_ct_is_confirmed(ct))
    +		return false;
    +
    +	if (is_unconfirmed)
    +		*is_unconfirmed = true;
    +
    +	/* in some cases skb_clone() can occur after initial conntrack
    +	 * pickup, but conntrack assumes exclusive skb->_nfct ownership for
    +	 * unconfirmed entries.
    +	 *
    +	 * This happens for br_netfilter and with ip multicast routing.
    +	 * This can't be solved with serialization here because one clone
    +	 * could have been queued for local delivery or could be transmitted
    +	 * in parallel on another CPU.
    +	 */
    +	return refcount_read(&ct->ct_general.use) > 1;
    +#endif
    +	return false;
    +}
    +
     static void nfqnl_reinject(struct nf_queue_entry *entry, unsigned int verdict)
     {
     	const struct nf_ct_hook *ct_hook;
    @@ -465,6 +493,24 @@ static void nfqnl_reinject(struct nf_queue_entry *entry, unsigned int verdict)
     			break;
     		}
     	}
    +
    +	if (verdict != NF_DROP && entry->nf_ct_is_unconfirmed) {
    +		/* If first queued segment was already reinjected then
    +		 * there is a good chance the ct entry is now confirmed.
    +		 *
    +		 * Handle the rare cases:
    +		 *  - out-of-order verdict
    +		 *  - threaded userspace reinjecting in parallel
    +		 *  - first segment was dropped
    +		 *
    +		 * In all of those cases we can't handle this packet
    +		 * because we can't be sure that another CPU won't modify
    +		 * nf_conn->ext in parallel which isn't allowed.
    +		 */
    +		if (nf_ct_drop_unconfirmed(entry, NULL))
    +			verdict = NF_DROP;
    +	}
    +
     	nf_reinject(entry, verdict);
     }
     
    @@ -894,49 +940,6 @@ nlmsg_failure:
     	return NULL;
     }
     
    -static bool nf_ct_drop_unconfirmed(const struct nf_queue_entry *entry)
    -{
    -#if IS_ENABLED(CONFIG_NF_CONNTRACK)
    -	static const unsigned long flags = IPS_CONFIRMED | IPS_DYING;
    -	struct nf_conn *ct = (void *)skb_nfct(entry->skb);
    -	unsigned long status;
    -	unsigned int use;
    -
    -	if (!ct)
    -		return false;
    -
    -	status = READ_ONCE(ct->status);
    -	if ((status & flags) == IPS_DYING)
    -		return true;
    -
    -	if (status & IPS_CONFIRMED)
    -		return false;
    -
    -	/* in some cases skb_clone() can occur after initial conntrack
    -	 * pickup, but conntrack assumes exclusive skb->_nfct ownership for
    -	 * unconfirmed entries.
    -	 *
    -	 * This happens for br_netfilter and with ip multicast routing.
    -	 * We can't be solved with serialization here because one clone could
    -	 * have been queued for local delivery.
    -	 */
    -	use = refcount_read(&ct->ct_general.use);
    -	if (likely(use == 1))
    -		return false;
    -
    -	/* Can't decrement further? Exclusive ownership. */
    -	if (!refcount_dec_not_one(&ct->ct_general.use))
    -		return false;
    -
    -	skb_set_nfct(entry->skb, 0);
    -	/* No nf_ct_put(): we already decremented .use and it cannot
    -	 * drop down to 0.
    -	 */
    -	return true;
    -#endif
    -	return false;
    -}
    -
     static int
     __nfqnl_enqueue_packet(struct net *net, struct nfqnl_instance *queue,
     			struct nf_queue_entry *entry)
    @@ -953,9 +956,6 @@ __nfqnl_enqueue_packet(struct net *net, struct nfqnl_instance *queue,
     	}
     	spin_lock_bh(&queue->lock);
     
    -	if (nf_ct_drop_unconfirmed(entry))
    -		goto err_out_free_nskb;
    -
     	if (queue->queue_total >= queue->queue_maxlen)
     		goto err_out_queue_drop;
     
    @@ -998,7 +998,6 @@ err_out_queue_drop:
     		else
     			net_warn_ratelimited("nf_queue: hash insert failed: %d\n", err);
     	}
    -err_out_free_nskb:
     	kfree_skb(nskb);
     err_out_unlock:
     	spin_unlock_bh(&queue->lock);
    @@ -1077,9 +1076,10 @@ __nfqnl_enqueue_packet_gso(struct net *net, struct nfqnl_instance *queue,
     static int
     nfqnl_enqueue_packet(struct nf_queue_entry *entry, unsigned int queuenum)
     {
    -	unsigned int queued;
    -	struct nfqnl_instance *queue;
     	struct sk_buff *skb, *segs, *nskb;
    +	bool ct_is_unconfirmed = false;
    +	struct nfqnl_instance *queue;
    +	unsigned int queued;
     	int err = -ENOBUFS;
     	struct net *net = entry->state.net;
     	struct nfnl_queue_net *q = nfnl_queue_pernet(net);
    @@ -1103,6 +1103,15 @@ nfqnl_enqueue_packet(struct nf_queue_entry *entry, unsigned int queuenum)
     		break;
     	}
     
    +	/* Check if someone already holds another reference to
    +	 * unconfirmed ct.  If so, we cannot queue the skb:
    +	 * concurrent modifications of nf_conn->ext are not
    +	 * allowed and we can't know if another CPU isn't
    +	 * processing the same nf_conn entry in parallel.
    +	 */
    +	if (nf_ct_drop_unconfirmed(entry, &ct_is_unconfirmed))
    +		return -EINVAL;
    +
     	if (!skb_is_gso(skb) || ((queue->flags & NFQA_CFG_F_GSO) && !skb_is_gso_sctp(skb)))
     		return __nfqnl_enqueue_packet(net, queue, entry);
     
    @@ -1116,7 +1125,23 @@ nfqnl_enqueue_packet(struct nf_queue_entry *entry, unsigned int queuenum)
     		goto out_err;
     	queued = 0;
     	err = 0;
    +
     	skb_list_walk_safe(segs, segs, nskb) {
    +		if (ct_is_unconfirmed && queued > 0) {
    +			/* skb_gso_segment() increments the ct refcount.
    +			 * This is a problem for unconfirmed (not in hash)
    +			 * entries, those can race when reinjections happen
    +			 * in parallel.
    +			 *
    +			 * Annotate this for all queued entries except the
    +			 * first one.
    +			 *
    +			 * As long as the first one is reinjected first it
    +			 * will do the confirmation for us.
    +			 */
    +			entry->nf_ct_is_unconfirmed = ct_is_unconfirmed;
    +		}
    +
     		if (err == 0)
     			err = __nfqnl_enqueue_packet_gso(net, queue,
     							segs, entry);
    -- 
    cgit 1.3-korg
    
    
    

Vulnerability mechanics

Root cause

"The shared-unconfirmed conntrack check was performed after skb_gso_segment() had already cloned the skb and elevated the conntrack reference count, causing false-positive drops."

Attack vector

An attacker sends a UDP GSO (Generic Segmentation Offload) packet that triggers an unconfirmed conntrack entry. If the nfqueue application has not set the `F_GSO` capability flag, the kernel calls `skb_gso_segment()`, which clones the skb and increments the conntrack reference count via `skb_clone`. The original code then checked `nf_ct_drop_unconfirmed()` on the already-segmented clones, found a refcount > 1, and dropped all packets instead of queuing them [patch_id=2662015]. This only affects UDP because TCP SYN packets (the only unconfirmed TCP packets) are not aggregated by GRO [patch_id=2662015].

Affected code

The vulnerability is in `net/netfilter/nfnetlink_queue.c`, specifically in the `nfqnl_enqueue_packet()` and `__nfqnl_enqueue_packet()` functions. The `nf_ct_drop_unconfirmed()` check was performed inside `__nfqnl_enqueue_packet()` — after `skb_gso_segment()` had already cloned the skb and elevated the conntrack reference count. The fix also adds a new `nf_ct_is_unconfirmed` field to `struct nf_queue_entry` in `include/net/netfilter/nf_queue.h` [patch_id=2662015].

What the fix does

The patch moves the `nf_ct_drop_unconfirmed()` check from `__nfqnl_enqueue_packet()` (called per-segment after GSO segmentation) to `nfqnl_enqueue_packet()` — before `skb_gso_segment()` is called, so the check runs against the original aggregated packet while the conntrack reference count is still 1 [patch_id=2662015]. A new `nf_ct_is_unconfirmed` boolean field is added to `struct nf_queue_entry` to annotate all segments except the first one. At reinject time in `nfqnl_reinject()`, a second check on annotated entries drops the packet if the conntrack is still unconfirmed with a refcount > 1 (handling out-of-order or parallel reinjection). The old `nf_ct_drop_unconfirmed()` was also simplified to only check `refcount_read(&ct->ct_general.use) > 1` without special-casing dying entries [patch_id=2662015].

Preconditions

  • confignfqueue application must not have set the F_GSO capability flag
  • inputIncoming packet must be a UDP GSO packet with an unconfirmed conntrack entry

Generated on May 27, 2026. Inputs: CWE entries + fix-commit diffs from this CVE's patches. Citations validated against bundle.

References

4

News mentions

0

No linked articles in our index yet.