Commit graph

92 commits

Author SHA1 Message Date
Paolo Abeni
941defcea7 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR (net-6.14-rc6).

Conflicts:

tools/testing/selftests/drivers/net/ping.py
  75cc19c8ff ("selftests: drv-net: add xdp cases for ping.py")
  de94e86974 ("selftests: drv-net: store addresses in dict indexed by ipver")
https://lore.kernel.org/netdev/20250311115758.17a1d414@canb.auug.org.au/

net/core/devmem.c
  a70f891e0f ("net: devmem: do not WARN conditionally after netdev_rx_queue_restart()")
  1d22d3060b ("net: drop rtnl_lock for queue_mgmt operations")
https://lore.kernel.org/netdev/20250313114929.43744df1@canb.auug.org.au/

Adjacent changes:

tools/testing/selftests/net/Makefile
  6f50175cca ("selftests: Add IPv6 link-local address generation tests for GRE devices.")
  2e5584e0f9 ("selftests/net: expand cmsg_ipv6.sh with ipv4")

drivers/net/ethernet/broadcom/bnxt/bnxt.c
  661958552e ("eth: bnxt: do not use BNXT_VNIC_NTUPLE unconditionally in queue restart logic")
  fe96d717d3 ("bnxt_en: Extend queue stop/start for TX rings")

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-03-13 23:08:11 +01:00
Shradha Gupta
3e64bb2ae7 net: mana: cleanup mana struct after debugfs_remove()
When on a MANA VM hibernation is triggered, as part of hibernate_snapshot(),
mana_gd_suspend() and mana_gd_resume() are called. If during this
mana_gd_resume(), a failure occurs with HWC creation, mana_port_debugfs
pointer does not get reinitialized and ends up pointing to older,
cleaned-up dentry.
Further in the hibernation path, as part of power_down(), mana_gd_shutdown()
is triggered. This call, unaware of the failures in resume, tries to cleanup
the already cleaned up  mana_port_debugfs value and hits the following bug:

[  191.359296] mana 7870:00:00.0: Shutdown was called
[  191.359918] BUG: kernel NULL pointer dereference, address: 0000000000000098
[  191.360584] #PF: supervisor write access in kernel mode
[  191.361125] #PF: error_code(0x0002) - not-present page
[  191.361727] PGD 1080ea067 P4D 0
[  191.362172] Oops: Oops: 0002 [#1] SMP NOPTI
[  191.362606] CPU: 11 UID: 0 PID: 1674 Comm: bash Not tainted 6.14.0-rc5+ #2
[  191.363292] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS Hyper-V UEFI Release v4.1 11/21/2024
[  191.364124] RIP: 0010:down_write+0x19/0x50
[  191.364537] Code: 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 55 48 89 e5 53 48 89 fb e8 de cd ff ff 31 c0 ba 01 00 00 00 <f0> 48 0f b1 13 75 16 65 48 8b 05 88 24 4c 6a 48 89 43 08 48 8b 5d
[  191.365867] RSP: 0000:ff45fbe0c1c037b8 EFLAGS: 00010246
[  191.366350] RAX: 0000000000000000 RBX: 0000000000000098 RCX: ffffff8100000000
[  191.366951] RDX: 0000000000000001 RSI: 0000000000000064 RDI: 0000000000000098
[  191.367600] RBP: ff45fbe0c1c037c0 R08: 0000000000000000 R09: 0000000000000001
[  191.368225] R10: ff45fbe0d2b01000 R11: 0000000000000008 R12: 0000000000000000
[  191.368874] R13: 000000000000000b R14: ff43dc27509d67c0 R15: 0000000000000020
[  191.369549] FS:  00007dbc5001e740(0000) GS:ff43dc663f380000(0000) knlGS:0000000000000000
[  191.370213] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  191.370830] CR2: 0000000000000098 CR3: 0000000168e8e002 CR4: 0000000000b73ef0
[  191.371557] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  191.372192] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
[  191.372906] Call Trace:
[  191.373262]  <TASK>
[  191.373621]  ? show_regs+0x64/0x70
[  191.374040]  ? __die+0x24/0x70
[  191.374468]  ? page_fault_oops+0x290/0x5b0
[  191.374875]  ? do_user_addr_fault+0x448/0x800
[  191.375357]  ? exc_page_fault+0x7a/0x160
[  191.375971]  ? asm_exc_page_fault+0x27/0x30
[  191.376416]  ? down_write+0x19/0x50
[  191.376832]  ? down_write+0x12/0x50
[  191.377232]  simple_recursive_removal+0x4a/0x2a0
[  191.377679]  ? __pfx_remove_one+0x10/0x10
[  191.378088]  debugfs_remove+0x44/0x70
[  191.378530]  mana_detach+0x17c/0x4f0
[  191.378950]  ? __flush_work+0x1e2/0x3b0
[  191.379362]  ? __cond_resched+0x1a/0x50
[  191.379787]  mana_remove+0xf2/0x1a0
[  191.380193]  mana_gd_shutdown+0x3b/0x70
[  191.380642]  pci_device_shutdown+0x3a/0x80
[  191.381063]  device_shutdown+0x13e/0x230
[  191.381480]  kernel_power_off+0x35/0x80
[  191.381890]  hibernate+0x3c6/0x470
[  191.382312]  state_store+0xcb/0xd0
[  191.382734]  kobj_attr_store+0x12/0x30
[  191.383211]  sysfs_kf_write+0x3e/0x50
[  191.383640]  kernfs_fop_write_iter+0x140/0x1d0
[  191.384106]  vfs_write+0x271/0x440
[  191.384521]  ksys_write+0x72/0xf0
[  191.384924]  __x64_sys_write+0x19/0x20
[  191.385313]  x64_sys_call+0x2b0/0x20b0
[  191.385736]  do_syscall_64+0x79/0x150
[  191.386146]  ? __mod_memcg_lruvec_state+0xe7/0x240
[  191.386676]  ? __lruvec_stat_mod_folio+0x79/0xb0
[  191.387124]  ? __pfx_lru_add+0x10/0x10
[  191.387515]  ? queued_spin_unlock+0x9/0x10
[  191.387937]  ? do_anonymous_page+0x33c/0xa00
[  191.388374]  ? __handle_mm_fault+0xcf3/0x1210
[  191.388805]  ? __count_memcg_events+0xbe/0x180
[  191.389235]  ? handle_mm_fault+0xae/0x300
[  191.389588]  ? do_user_addr_fault+0x559/0x800
[  191.390027]  ? irqentry_exit_to_user_mode+0x43/0x230
[  191.390525]  ? irqentry_exit+0x1d/0x30
[  191.390879]  ? exc_page_fault+0x86/0x160
[  191.391235]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[  191.391745] RIP: 0033:0x7dbc4ff1c574
[  191.392111] Code: c7 00 16 00 00 00 b8 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 80 3d d5 ea 0e 00 00 74 13 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 54 c3 0f 1f 00 55 48 89 e5 48 83 ec 20 48 89
[  191.393412] RSP: 002b:00007ffd95a23ab8 EFLAGS: 00000202 ORIG_RAX: 0000000000000001
[  191.393990] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007dbc4ff1c574
[  191.394594] RDX: 0000000000000005 RSI: 00005a6eeadb0ce0 RDI: 0000000000000001
[  191.395215] RBP: 00007ffd95a23ae0 R08: 00007dbc50003b20 R09: 0000000000000000
[  191.395805] R10: 0000000000000001 R11: 0000000000000202 R12: 0000000000000005
[  191.396404] R13: 00005a6eeadb0ce0 R14: 00007dbc500045c0 R15: 00007dbc50001ee0
[  191.396987]  </TASK>

To fix this, we explicitly set such mana debugfs variables to NULL after
debugfs_remove() is called.

Fixes: 6607c17c6c ("net: mana: Enable debugfs files for MANA device")
Cc: stable@vger.kernel.org
Signed-off-by: Shradha Gupta <shradhagupta@linux.microsoft.com>
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
Reviewed-by: Michal Kubiak <michal.kubiak@intel.com>
Link: https://patch.msgid.link/1741688260-28922-1-git-send-email-shradhagupta@linux.microsoft.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-03-13 13:26:35 +01:00
Jakub Kicinski
8ef890df40 net: move misc netdev_lock flavors to a separate header
Move the more esoteric helpers for netdev instance lock to
a dedicated header. This avoids growing netdevice.h to infinity
and makes rebuilding the kernel much faster (after touching
the header with the helpers).

The main netdev_lock() / netdev_unlock() functions are used
in static inlines in netdevice.h and will probably be used
most commonly, so keep them in netdevice.h.

Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250307183006.2312761-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-08 09:06:50 -08:00
Erni Sri Satya Vennela
47dfd7a722 net: mana: Add debug logs in MANA network driver
Add more logs to assist in debugging and monitoring
driver behaviour, making it easier to identify potential
issues  during development and testing.

Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
Link: https://patch.msgid.link/1739842455-23899-1-git-send-email-ernis@linux.microsoft.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-19 18:50:59 -08:00
Shradha Gupta
27315836f4 net: mana: Allow tso_max_size to go up-to GSO_MAX_SIZE
Allow the max aggregated pkt size to go up-to GSO_MAX_SIZE for MANA NIC.
This patch only increases the max allowable gso/gro pkt size for MANA
devices and does not change the defaults.
Following are the perf benefits by increasing the pkt aggregate size from
legacy gso_max_size value(64K) to newer one(up-to 511K

IPv4 tests
for i in {1..10}; do netperf -t TCP_RR  -H 10.0.0.5 -p50000 -- -r80000,80000
-O MIN_LATENCY,P90_LATENCY,P99_LATENCY,THROUGHPUT|tail -1; done

min	p90	p99	Throughput		gso_max_size
93	171	194	6594.25
97	154	180	7183.74
95	165	189	6927.86
96	165	188	6976.04
93	154	185	7338.05			64K
93	168	189	6938.03
94	169	189	6784.93
92	166	189	7117.56
94	179	191	6678.44
95	157	183	7277.81

min	p90	p99	Throughput
93	134	146	8448.75
95	134	140	8396.54
94	137	148	8204.12
94	137	148	8244.41
94	128	139	8666.52			80K
94	141	153	8116.86
94	138	149	8163.92
92	135	142	8362.72
92	134	142	8497.57
93	136	148	8393.23

IPv6 Tests
for i in {1..10}; do netperf -t TCP_RR  -H fd00:9013:cadd::4 -p50000 --
-r80000,80000 -O MIN_LATENCY,P90_LATENCY,P99_LATENCY,THROUGHPUT|tail -1; done

min	p90	p99	Throughput		gso_max_size
108	165	170	6673.2
101	169	189	6451.69
101	165	169	6737.65
102	167	175	6614.64
101	178	189	6247.13			64K
107	163	169	6678.63
106	176	187	6350.86
100	164	169	6617.36
102	163	170	6849.21
102	168	175	6605.7

min	p90	p99	Throughput
108	155	166	7183
110	154	163	7268.87
109	152	159	7434.35
107	145	157	7569.15
107	149	164	7496.17			80K
110	154	159	7245.85
108	156	162	7266.24
109	145	158	7526.66
106	145	151	7785.75
111	148	157	7246.65

Tested on azure env with Accelerated Networking enabled and disabled.

Signed-off-by: Shradha Gupta <shradhagupta@linux.microsoft.com>
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2025-02-19 09:42:52 +00:00
Linus Torvalds
896d8946da Including fixes from can and netfilter.
Current release - regressions:
 
   - rtnetlink: fix double call of rtnl_link_get_net_ifla()
 
   - tcp: populate XPS related fields of timewait sockets
 
   - ethtool: fix access to uninitialized fields in set RXNFC command
 
   - selinux: use sk_to_full_sk() in selinux_ip_output()
 
 Current release - new code bugs:
 
   - net: make napi_hash_lock irq safe
 
   - eth: bnxt_en: support header page pool in queue API
 
   - eth: ice: fix NULL pointer dereference in switchdev
 
 Previous releases - regressions:
 
   - core: fix icmp host relookup triggering ip_rt_bug
 
   - ipv6:
     - avoid possible NULL deref in modify_prefix_route()
     - release expired exception dst cached in socket
 
   - smc: fix LGR and link use-after-free issue
 
   - hsr: avoid potential out-of-bound access in fill_frame_info()
 
   - can: hi311x: fix potential use-after-free
 
   - eth: ice: fix VLAN pruning in switchdev mode
 
 Previous releases - always broken:
 
   - netfilter:
     - ipset: hold module reference while requesting a module
     - nft_inner: incorrect percpu area handling under softirq
 
   - can: j1939: fix skb reference counting
 
   - eth: mlxsw: use correct key block on Spectrum-4
 
   - eth: mlx5: fix memory leak in mlx5hws_definer_calc_layout
 
 Signed-off-by: Paolo Abeni <pabeni@redhat.com>
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmdRve4SHHBhYmVuaUBy
 ZWRoYXQuY29tAAoJECkkeY3MjxOk/TUQAItDVkTQDiAUUqd0TDH2SSnboxXozcMF
 fKVrdWulFAOe7qcajUqjkzvTDjAOw9Xbfh8rEiFBLaXmUzSUT2eqbm2VahvWIR5/
 k08v7fTTuCNzEwhnbQlsZ47Nd26LJVPwOvbtE/4V8d50pU/serjWuI/74tUCWAjn
 DQOjyyqRjKgKKY+WWCQ6g4tVpD//DB4Sj15GiE3MhlW1f08AAnPJSe2oTaq0RZBG
 nXo7abOGn8x3RYilrlp/ZwWYuNpVI4oF+qmp+t/46NV+7ER1JgrC97k0kFyyCYVD
 g7vBvFjvA7vDmiuzfsOW2n7IRdcfBtkfi8UJYOIvVgJg7KDF0DXoxj3qD4FagI35
 RWRMJw+PoNlFFkPprlp0we/19jPJWOO6rx+AOMEQD78jrH7NoFQ/+eeBf/nppfjy
 wX0LD1aQsgPk2ju0I8GbcM/qaJ81EJUiYYVLJHieH0+vvmqts8cMDRLZf5t08EHa
 myXcRGB9N8gjguGp5mdR5KmtY82zASlNC0PbDp3nlPYc3e/opCmMRYGjBohO+vqn
 n7u250WThPwiBtOwYmSbcK7zpS034/VX0ufnTT2X3MWnFGGNDv6XVmho/OBuCHqJ
 m/EiJo/D9qVGAIql/vAxN9a4lQKrcZgaMzCyEPYCf7rtLmzx7sfyHWbf/GZlUAU/
 9dUUfWqCbZcL
 =nkPk
 -----END PGP SIGNATURE-----

Merge tag 'net-6.13-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Paolo Abeni:
 "Including fixes from can and netfilter.

  Current release - regressions:

   - rtnetlink: fix double call of rtnl_link_get_net_ifla()

   - tcp: populate XPS related fields of timewait sockets

   - ethtool: fix access to uninitialized fields in set RXNFC command

   - selinux: use sk_to_full_sk() in selinux_ip_output()

  Current release - new code bugs:

   - net: make napi_hash_lock irq safe

   - eth:
      - bnxt_en: support header page pool in queue API
      - ice: fix NULL pointer dereference in switchdev

  Previous releases - regressions:

   - core: fix icmp host relookup triggering ip_rt_bug

   - ipv6:
      - avoid possible NULL deref in modify_prefix_route()
      - release expired exception dst cached in socket

   - smc: fix LGR and link use-after-free issue

   - hsr: avoid potential out-of-bound access in fill_frame_info()

   - can: hi311x: fix potential use-after-free

   - eth: ice: fix VLAN pruning in switchdev mode

  Previous releases - always broken:

   - netfilter:
      - ipset: hold module reference while requesting a module
      - nft_inner: incorrect percpu area handling under softirq

   - can: j1939: fix skb reference counting

   - eth:
      - mlxsw: use correct key block on Spectrum-4
      - mlx5: fix memory leak in mlx5hws_definer_calc_layout"

* tag 'net-6.13-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (76 commits)
  net :mana :Request a V2 response version for MANA_QUERY_GF_STAT
  net: avoid potential UAF in default_operstate()
  vsock/test: verify socket options after setting them
  vsock/test: fix parameter types in SO_VM_SOCKETS_* calls
  vsock/test: fix failures due to wrong SO_RCVLOWAT parameter
  net/mlx5e: Remove workaround to avoid syndrome for internal port
  net/mlx5e: SD, Use correct mdev to build channel param
  net/mlx5: E-Switch, Fix switching to switchdev mode in MPV
  net/mlx5: E-Switch, Fix switching to switchdev mode with IB device disabled
  net/mlx5: HWS: Properly set bwc queue locks lock classes
  net/mlx5: HWS: Fix memory leak in mlx5hws_definer_calc_layout
  bnxt_en: handle tpa_info in queue API implementation
  bnxt_en: refactor bnxt_alloc_rx_rings() to call bnxt_alloc_rx_agg_bmap()
  bnxt_en: refactor tpa_info alloc/free into helpers
  geneve: do not assume mac header is set in geneve_xmit_skb()
  mlxsw: spectrum_acl_flex_keys: Use correct key block on Spectrum-4
  ethtool: Fix wrong mod state in case of verbose and no_mask bitset
  ipmr: tune the ipmr_can_free_table() checks.
  netfilter: nft_set_hash: skip duplicated elements pending gc run
  netfilter: ipset: Hold module reference while requesting a module
  ...
2024-12-05 10:25:06 -08:00
Shradha Gupta
31f1b55d5d net :mana :Request a V2 response version for MANA_QUERY_GF_STAT
The current requested response version(V1) for MANA_QUERY_GF_STAT query
results in STATISTICS_FLAGS_TX_ERRORS_GDMA_ERROR value being set to
0 always.
In order to get the correct value for this counter we request the response
version to be V2.

Cc: stable@vger.kernel.org
Fixes: e1df5202e8 ("net :mana :Add remaining GDMA stats for MANA to ethtool")
Signed-off-by: Shradha Gupta <shradhagupta@linux.microsoft.com>
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
Link: https://patch.msgid.link/1733291300-12593-1-git-send-email-shradhagupta@linux.microsoft.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-12-05 12:02:15 +01:00
Peter Zijlstra
cdd30ebb1b module: Convert symbol namespace to string literal
Clean up the existing export namespace code along the same lines of
commit 33def8498f ("treewide: Convert macro and uses of __section(foo)
to __section("foo")") and for the same reason, it is not desired for the
namespace argument to be a macro expansion itself.

Scripted using

  git grep -l -e MODULE_IMPORT_NS -e EXPORT_SYMBOL_NS | while read file;
  do
    awk -i inplace '
      /^#define EXPORT_SYMBOL_NS/ {
        gsub(/__stringify\(ns\)/, "ns");
        print;
        next;
      }
      /^#define MODULE_IMPORT_NS/ {
        gsub(/__stringify\(ns\)/, "ns");
        print;
        next;
      }
      /MODULE_IMPORT_NS/ {
        $0 = gensub(/MODULE_IMPORT_NS\(([^)]*)\)/, "MODULE_IMPORT_NS(\"\\1\")", "g");
      }
      /EXPORT_SYMBOL_NS/ {
        if ($0 ~ /(EXPORT_SYMBOL_NS[^(]*)\(([^,]+),/) {
  	if ($0 !~ /(EXPORT_SYMBOL_NS[^(]*)\(([^,]+), ([^)]+)\)/ &&
  	    $0 !~ /(EXPORT_SYMBOL_NS[^(]*)\(\)/ &&
  	    $0 !~ /^my/) {
  	  getline line;
  	  gsub(/[[:space:]]*\\$/, "");
  	  gsub(/[[:space:]]/, "", line);
  	  $0 = $0 " " line;
  	}

  	$0 = gensub(/(EXPORT_SYMBOL_NS[^(]*)\(([^,]+), ([^)]+)\)/,
  		    "\\1(\\2, \"\\3\")", "g");
        }
      }
      { print }' $file;
  done

Requested-by: Masahiro Yamada <masahiroy@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://mail.google.com/mail/u/2/#inbox/FMfcgzQXKWgMmjdFwwdsfgxzKpVHWPlc
Acked-by: Greg KH <gregkh@linuxfoundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-12-02 11:34:44 -08:00
Shradha Gupta
6607c17c6c net: mana: Enable debugfs files for MANA device
Implement debugfs in MANA driver to be able to view RX,TX,EQ queue
specific attributes and dump their gdma queues.
These dumps can be used by other userspace utilities to improve
debuggability and troubleshooting

Following files are added in debugfs:

/sys/kernel/debug/mana/
|-------------- 1
    |--------------- EQs
    |                 |------- eq0
    |                 |          |---head
    |                 |          |---tail
    |                 |          |---eq_dump
    |                 |------- eq1
    |                 .
    |                 .
    |
    |--------------- adapter-MTU
    |--------------- vport0
                      |------- RX-0
                      |          |---cq_budget
                      |          |---cq_dump
                      |          |---cq_head
                      |          |---cq_tail
                      |          |---rq_head
                      |          |---rq_nbuf
                      |          |---rq_tail
                      |          |---rxq_dump
                      |------- RX-1
                      .
                      .
                      |------- TX-0
                      |          |---cq_budget
                      |          |---cq_dump
                      |          |---cq_head
                      |          |---cq_tail
                      |          |---sq_head
                      |          |---sq_pend_skb_qlen
                      |          |---sq_tail
                      |          |---txq_dump
                      |------- TX-1
                      .
                      .

Signed-off-by: Shradha Gupta <shradhagupta@linux.microsoft.com>
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-10-09 13:42:04 +01:00
Jakub Kicinski
502cc061de Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR.

Conflicts:

drivers/net/phy/phy_device.c
  2560db6ede ("net: phy: Fix missing of_node_put() for leds")
  1dce520abd ("net: phy: Use for_each_available_child_of_node_scoped()")
https://lore.kernel.org/20240904115823.74333648@canb.auug.org.au

Adjacent changes:

drivers/net/ethernet/xilinx/xilinx_axienet.h
drivers/net/ethernet/xilinx/xilinx_axienet_main.c
  858430db28 ("net: xilinx: axienet: Fix race in axienet_stop")
  76abb5d675 ("net: xilinx: axienet: Add statistics support")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-09-05 20:37:20 -07:00
Shradha Gupta
1705341485 net: mana: Improve mana_set_channels() in low mem conditions
The mana_set_channels() function requires detaching the mana
driver and reattaching it with changed channel values.
During this operation if the system is low on memory, the reattach
might fail, causing the network device being down.
To avoid this we pre-allocate buffers at the beginning of set operation,
to prevent complete network loss

Signed-off-by: Shradha Gupta <shradhagupta@linux.microsoft.com>
Reviewed-by: Gerhard Engleder <gerhard@engleder-embedded.com>
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
Link: https://patch.msgid.link/1725248734-21760-1-git-send-email-shradhagupta@linux.microsoft.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-09-04 16:28:51 -07:00
Souradeep Chakrabarti
b6ecc66203 net: mana: Fix error handling in mana_create_txq/rxq's NAPI cleanup
Currently napi_disable() gets called during rxq and txq cleanup,
even before napi is enabled and hrtimer is initialized. It causes
kernel panic.

? page_fault_oops+0x136/0x2b0
  ? page_counter_cancel+0x2e/0x80
  ? do_user_addr_fault+0x2f2/0x640
  ? refill_obj_stock+0xc4/0x110
  ? exc_page_fault+0x71/0x160
  ? asm_exc_page_fault+0x27/0x30
  ? __mmdrop+0x10/0x180
  ? __mmdrop+0xec/0x180
  ? hrtimer_active+0xd/0x50
  hrtimer_try_to_cancel+0x2c/0xf0
  hrtimer_cancel+0x15/0x30
  napi_disable+0x65/0x90
  mana_destroy_rxq+0x4c/0x2f0
  mana_create_rxq.isra.0+0x56c/0x6d0
  ? mana_uncfg_vport+0x50/0x50
  mana_alloc_queues+0x21b/0x320
  ? skb_dequeue+0x5f/0x80

Cc: stable@vger.kernel.org
Fixes: e1b5683ff6 ("net: mana: Move NAPI from EQ to CQ")
Signed-off-by: Souradeep Chakrabarti <schakrabarti@linux.microsoft.com>
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
Reviewed-by: Shradha Gupta <shradhagupta@linux.microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-09-04 11:50:04 +01:00
Shradha Gupta
3410d0e14f net: mana: Implement get_ringparam/set_ringparam for mana
Currently the values of WQs for RX and TX queues for MANA devices
are hardcoded to default sizes.
Allow configuring these values for MANA devices as ringparam
configuration(get/set) through ethtool_ops.
Pre-allocate buffers at the beginning of this operation, to
prevent complete network loss in low-memory conditions.

Signed-off-by: Shradha Gupta <shradhagupta@linux.microsoft.com>
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
Reviewed-by: Saurabh Sengar <ssengar@linux.microsoft.com>
Link: https://patch.msgid.link/1724688461-12203-1-git-send-email-shradhagupta@linux.microsoft.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-27 16:08:44 -07:00
Long Li
58a63729c9 net: mana: Fix doorbell out of order violation and avoid unnecessary doorbell rings
After napi_complete_done() is called when NAPI is polling in the current
process context, another NAPI may be scheduled and start running in
softirq on another CPU and may ring the doorbell before the current CPU
does. When combined with unnecessary rings when there is no need to arm
the CQ, it triggers error paths in the hardware.

This patch fixes this by calling napi_complete_done() after doorbell
rings. It limits the number of unnecessary rings when there is
no need to arm. MANA hardware specifies that there must be one doorbell
ring every 8 CQ wraparounds. This driver guarantees one doorbell ring as
soon as the number of consumed CQEs exceeds 4 CQ wraparounds. In practical
workloads, the 4 CQ wraparounds proves to be big enough that it rarely
exceeds this limit before all the napi weight is consumed.

To implement this, add a per-CQ counter cq->work_done_since_doorbell,
and make sure the CQ is armed as soon as passing 4 wraparounds of the CQ.

Cc: stable@vger.kernel.org
Fixes: e1b5683ff6 ("net: mana: Move NAPI from EQ to CQ")
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: Long Li <longli@microsoft.com>
Link: https://patch.msgid.link/1723219138-29887-1-git-send-email-longli@linuxonhyperv.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-08-13 13:09:54 +02:00
Haiyang Zhang
32316f676b net: mana: Fix RX buf alloc_size alignment and atomic op panic
The MANA driver's RX buffer alloc_size is passed into napi_build_skb() to
create SKB. skb_shinfo(skb) is located at the end of skb, and its alignment
is affected by the alloc_size passed into napi_build_skb(). The size needs
to be aligned properly for better performance and atomic operations.
Otherwise, on ARM64 CPU, for certain MTU settings like 4000, atomic
operations may panic on the skb_shinfo(skb)->dataref due to alignment fault.

To fix this bug, add proper alignment to the alloc_size calculation.

Sample panic info:
[  253.298819] Unable to handle kernel paging request at virtual address ffff000129ba5cce
[  253.300900] Mem abort info:
[  253.301760]   ESR = 0x0000000096000021
[  253.302825]   EC = 0x25: DABT (current EL), IL = 32 bits
[  253.304268]   SET = 0, FnV = 0
[  253.305172]   EA = 0, S1PTW = 0
[  253.306103]   FSC = 0x21: alignment fault
Call trace:
 __skb_clone+0xfc/0x198
 skb_clone+0x78/0xe0
 raw6_local_deliver+0xfc/0x228
 ip6_protocol_deliver_rcu+0x80/0x500
 ip6_input_finish+0x48/0x80
 ip6_input+0x48/0xc0
 ip6_sublist_rcv_finish+0x50/0x78
 ip6_sublist_rcv+0x1cc/0x2b8
 ipv6_list_rcv+0x100/0x150
 __netif_receive_skb_list_core+0x180/0x220
 netif_receive_skb_list_internal+0x198/0x2a8
 __napi_poll+0x138/0x250
 net_rx_action+0x148/0x330
 handle_softirqs+0x12c/0x3a0

Cc: stable@vger.kernel.org
Fixes: 80f6215b45 ("net: mana: Add support for jumbo frame")
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Reviewed-by: Long Li <longli@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-08-12 13:26:35 +01:00
Linus Torvalds
3d51520954 RDMA v6.11 merge window
Usual collection of small improvements and fixes:
 
 - Bug fixes and minor improvments in efa, irdma, mlx4, mlx5, rxe, hf1,
   qib, ocrdma
 
 - bnxt_re support for MSN, which is a new retransmit logic
 
 - Initial mana support for RC qps
 
 - Use after free bug and cleanups in iwcm
 
 - Reduce resource usage in mlx5 when RDMA verbs features are not used
 
 - New verb to drain shared recieve queues, similar to normal recieve
   queues. This is necessary to allow ULPs a clean shutdown. Used in the
   iscsi rdma target
 
 - mlx5 support for more than 16 bits of doorbell indexes
 
 - Doorbell moderation support for bnxt_re
 
 - IB multi-plane support for mlx5
 
 - New EFA adaptor PCI IDs
 
 - RDMA_NAME_ASSIGN_TYPE_USER to hint to userspace that it shouldn't rename
   the device
 
 - A collection of hns bugs
 
 - Fix long standing bug in bnxt_re with incorrect endian handling of
   immediate data
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQRRRCHOFoQz/8F5bUaFwuHvBreFYQUCZpfvKQAKCRCFwuHvBreF
 YXomAP46gZpGv5mlMOAXePRuKq6glNZWl3pVuwuycnlmjQcEUQD/dhQbJz0rZKBr
 swuibPo83bFacfXJL7Wxd48m4G3EfgI=
 =1eXu
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma

Pull rdma updates from Jason Gunthorpe:
 "Usual collection of small improvements and fixes:

   - Bug fixes and minor improvments in efa, irdma, mlx4, mlx5, rxe,
     hf1, qib, ocrdma

   - bnxt_re support for MSN, which is a new retransmit logic

   - Initial mana support for RC qps

   - Use after free bug and cleanups in iwcm

   - Reduce resource usage in mlx5 when RDMA verbs features are not used

   - New verb to drain shared recieve queues, similar to normal recieve
     queues. This is necessary to allow ULPs a clean shutdown. Used in
     the iscsi rdma target

   - mlx5 support for more than 16 bits of doorbell indexes

   - Doorbell moderation support for bnxt_re

   - IB multi-plane support for mlx5

   - New EFA adaptor PCI IDs

   - RDMA_NAME_ASSIGN_TYPE_USER to hint to userspace that it shouldn't
     rename the device

   - A collection of hns bugs

   - Fix long standing bug in bnxt_re with incorrect endian handling of
     immediate data"

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (65 commits)
  IB/hfi1: Constify struct flag_table
  RDMA/mana_ib: Set correct device into ib
  bnxt_re: Fix imm_data endianness
  RDMA: Fix netdev tracker in ib_device_set_netdev
  RDMA/hns: Fix mbx timing out before CMD execution is completed
  RDMA/hns: Fix insufficient extend DB for VFs.
  RDMA/hns: Fix undifined behavior caused by invalid max_sge
  RDMA/hns: Fix shift-out-bounds when max_inline_data is 0
  RDMA/hns: Fix missing pagesize and alignment check in FRMR
  RDMA/hns: Fix unmatch exception handling when init eq table fails
  RDMA/hns: Fix soft lockup under heavy CEQE load
  RDMA/hns: Check atomic wr length
  RDMA/ocrdma: Don't inline statistics functions
  RDMA/core: Introduce "name_assign_type" for an IB device
  RDMA/qib: Fix truncation compilation warnings in qib_verbs.c
  RDMA/qib: Fix truncation compilation warnings in qib_init.c
  RDMA/efa: Add EFA 0xefa3 PCI ID
  RDMA/mlx5: Support per-plane port IB counters by querying PPCNT register
  net/mlx5: mlx5_ifc update for accessing ppcnt register of plane ports
  RDMA/mlx5: Add plane index support when querying PTYS registers
  ...
2024-07-19 09:51:33 -07:00
Konstantin Taranov
1df03a4b44 RDMA/mana_ib: Set correct device into ib
Add mana_get_primary_netdev_rcu helper to get a primary
netdevice for a given port. When mana is used with
netvsc, the VF netdev is controlled by an upper netvsc
device. In a baremetal case, the VF netdev is the
primary device.

Use the mana_get_primary_netdev_rcu() helper in the mana_ib
to get the correct device for querying network states.

Fixes: 8b184e4f1c ("RDMA/mana_ib: Enable RoCE on port 1")
Signed-off-by: Konstantin Taranov <kotaranov@microsoft.com>
Link: https://lore.kernel.org/r/1720705077-322-1-git-send-email-kotaranov@linux.microsoft.com
Reviewed-by: Long Li <longli@microsoft.com>
Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-07-14 10:49:53 +03:00
Jakub Kicinski
193b9b2002 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR.

No conflicts.

Adjacent changes:
  e3f02f32a0 ("ionic: fix kernel panic due to multi-buffer handling")
  d9c0420999 ("ionic: Mark error paths in the data path as unlikely")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-27 12:14:11 -07:00
Ma Ke
1864b82241 net: mana: Fix possible double free in error handling path
When auxiliary_device_add() returns error and then calls
auxiliary_device_uninit(), callback function adev_release
calls kfree(madev). We shouldn't call kfree(madev) again
in the error handling path. Set 'madev' to NULL.

Fixes: a69839d432 ("net: mana: Add support for auxiliary device")
Signed-off-by: Ma Ke <make24@iscas.ac.cn>
Link: https://patch.msgid.link/20240625130314.2661257-1-make24@iscas.ac.cn
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-06-27 12:35:58 +02:00
Haiyang Zhang
382d1741b5 net: mana: Add support for page sizes other than 4KB on ARM64
As defined by the MANA Hardware spec, the queue size for DMA is 4KB
minimal, and power of 2. And, the HWC queue size has to be exactly
4KB.

To support page sizes other than 4KB on ARM64, define the minimal
queue size as a macro separately from the PAGE_SIZE, which we always
assumed it to be 4KB before supporting ARM64.

Also, add MANA specific macros and update code related to size
alignment, DMA region calculations, etc.

Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Reviewed-by: Michael Kelley <mhklinux@outlook.com>
Link: https://lore.kernel.org/r/1718655446-6576-1-git-send-email-haiyangz@microsoft.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-18 18:21:18 -07:00
Shradha Gupta
e275e19c91 net: mana: Use mana_cleanup_port_context() for rxq cleanup
To cleanup rxqs in port context structures, instead of duplicating the
code, use existing function mana_cleanup_port_context() which does
the exact cleanup that's needed.

Signed-off-by: Shradha Gupta <shradhagupta@linux.microsoft.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Wei Liu <wei.liu@kernel.org>
Reviewed-by: Heng Qi <hengqi@linux.alibaba.com>
Link: https://lore.kernel.org/r/1718349548-28697-1-git-send-email-shradhagupta@linux.microsoft.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-17 18:01:01 -07:00
Shradha Gupta
7fc45cb686 net: mana: Allow variable size indirection table
Allow variable size indirection table allocation in MANA instead
of using a constant value MANA_INDIRECT_TABLE_SIZE.
The size is now derived from the MANA_QUERY_VPORT_CONFIG and the
indirection table is allocated dynamically.

Signed-off-by: Shradha Gupta <shradhagupta@linux.microsoft.com>
Link: https://lore.kernel.org/r/1718015319-9609-1-git-send-email-shradhagupta@linux.microsoft.com
Reviewed-by: Dexuan Cui <decui@microsoft.com>
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-06-12 21:08:42 +03:00
Eric Dumazet
1eb2cded45 net: annotate writes on dev->mtu from ndo_change_mtu()
Simon reported that ndo_change_mtu() methods were never
updated to use WRITE_ONCE(dev->mtu, new_mtu) as hinted
in commit 501a90c945 ("inet: protect against too small
mtu values.")

We read dev->mtu without holding RTNL in many places,
with READ_ONCE() annotations.

It is time to take care of ndo_change_mtu() methods
to use corresponding WRITE_ONCE()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Simon Horman <horms@kernel.org>
Closes: https://lore.kernel.org/netdev/20240505144608.GB67882@kernel.org/
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Reviewed-by: Simon Horman <horms@kernel.org>
Acked-by: Shannon Nelson <shannon.nelson@amd.com>
Link: https://lore.kernel.org/r/20240506102812.3025432-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-07 16:19:14 -07:00
Jakub Kicinski
0e36c21d76 Merge branch mana-ib-flex of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git
Erick Archer says:

====================
mana: Add flex array to struct mana_cfg_rx_steer_req_v2 (part)

The "struct mana_cfg_rx_steer_req_v2" uses a dynamically sized set of
trailing elements. Specifically, it uses a "mana_handle_t" array. So,
use the preferred way in the kernel declaring a flexible array [1].

At the same time, prepare for the coming implementation by GCC and Clang
of the __counted_by attribute. Flexible array members annotated with
__counted_by can have their accesses bounds-checked at run-time via
CONFIG_UBSAN_BOUNDS (for array indexing) and CONFIG_FORTIFY_SOURCE (for
strcpy/memcpy-family functions).

Also, avoid the open-coded arithmetic in the memory allocator functions
[2] using the "struct_size" macro.

Moreover, use the "offsetof" helper to get the indirect table offset
instead of the "sizeof" operator and avoid the open-coded arithmetic in
pointers using the new flex member. This new structure member also allow
us to remove the "req_indir_tab" variable since it is no longer needed.

Now, it is also possible to use the "flex_array_size" helper to compute
the size of these trailing elements in the "memcpy" function.

Specifically, the first commit adds the flex member and the patches 2 and
3 refactor the consumers of the "struct mana_cfg_rx_steer_req_v2".

This code was detected with the help of Coccinelle, and audited and
modified manually. The Coccinelle script used to detect this code pattern
is the following:

virtual report

@rule1@
type t1;
type t2;
identifier i0;
identifier i1;
identifier i2;
identifier ALLOC =~ "kmalloc|kzalloc|kmalloc_node|kzalloc_node|vmalloc|vzalloc|kvmalloc|kvzalloc";
position p1;
@@

i0 = sizeof(t1) + sizeof(t2) * i1;
...
i2 = ALLOC@p1(..., i0, ...);

@script:python depends on report@
p1 << rule1.p1;
@@

msg = "WARNING: verify allocation on line %s" % (p1[0].line)
coccilib.report.print_report(p1[0],msg)

Link: https://www.kernel.org/doc/html/next/process/deprecated.html#zero-length-and-one-element-arrays [1]
Link: https://www.kernel.org/doc/html/next/process/deprecated.html#open-coded-arithmetic-in-allocator-arguments [2]

v1: https://lore.kernel.org/linux-hardening/AS8PR02MB7237974EF1B9BAFA618166C38B382@AS8PR02MB7237.eurprd02.prod.outlook.com/
v2: https://lore.kernel.org/linux-hardening/AS8PR02MB723729C5A63F24C312FC9CD18B3F2@AS8PR02MB7237.eurprd02.prod.outlook.com/
====================

Link: https://lore.kernel.org/r/AS8PR02MB72374BD1B23728F2E3C3B1A18B022@AS8PR02MB7237.eurprd02.prod.outlook.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-11 07:38:28 -07:00
Erick Archer
a68292eb43 net: mana: Avoid open coded arithmetic
This is an effort to get rid of all multiplications from allocation
functions in order to prevent integer overflows [1][2].

As the "req" variable is a pointer to "struct mana_cfg_rx_steer_req_v2"
and this structure ends in a flexible array:

struct mana_cfg_rx_steer_req_v2 {
        [...]
        mana_handle_t indir_tab[] __counted_by(num_indir_entries);
};

the preferred way in the kernel is to use the struct_size() helper to
do the arithmetic instead of the calculation "size + size * count" in
the kzalloc() function.

Moreover, use the "offsetof" helper to get the indirect table offset
instead of the "sizeof" operator and avoid the open-coded arithmetic in
pointers using the new flex member. This new structure member also allow
us to remove the "req_indir_tab" variable since it is no longer needed.

Now, it is also possible to use the "flex_array_size" helper to compute
the size of these trailing elements in the "memcpy" function.

This way, the code is more readable and safer.

This code was detected with the help of Coccinelle, and audited and
modified manually.

Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#open-coded-arithmetic-in-allocator-arguments [1]
Link: https://github.com/KSPP/linux/issues/160 [2]
Signed-off-by: Erick Archer <erick.archer@outlook.com>
Link: https://lore.kernel.org/r/AS8PR02MB7237A21355C86EC0DCC0D83B8B022@AS8PR02MB7237.eurprd02.prod.outlook.com
Reviewed-by: Justin Stitt <justinstitt@google.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-04-11 13:46:47 +03:00
Haiyang Zhang
c0de6ab920 net: mana: Fix Rx DMA datasize and skb_over_panic
mana_get_rxbuf_cfg() aligns the RX buffer's DMA datasize to be
multiple of 64. So a packet slightly bigger than mtu+14, say 1536,
can be received and cause skb_over_panic.

Sample dmesg:
[ 5325.237162] skbuff: skb_over_panic: text:ffffffffc043277a len:1536 put:1536 head:ff1100018b517000 data:ff1100018b517100 tail:0x700 end:0x6ea dev:<NULL>
[ 5325.243689] ------------[ cut here ]------------
[ 5325.245748] kernel BUG at net/core/skbuff.c:192!
[ 5325.247838] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
[ 5325.258374] RIP: 0010:skb_panic+0x4f/0x60
[ 5325.302941] Call Trace:
[ 5325.304389]  <IRQ>
[ 5325.315794]  ? skb_panic+0x4f/0x60
[ 5325.317457]  ? asm_exc_invalid_op+0x1f/0x30
[ 5325.319490]  ? skb_panic+0x4f/0x60
[ 5325.321161]  skb_put+0x4e/0x50
[ 5325.322670]  mana_poll+0x6fa/0xb50 [mana]
[ 5325.324578]  __napi_poll+0x33/0x1e0
[ 5325.326328]  net_rx_action+0x12e/0x280

As discussed internally, this alignment is not necessary. To fix
this bug, remove it from the code. So oversized packets will be
marked as CQE_RX_TRUNCATED by NIC, and dropped.

Cc: stable@vger.kernel.org
Fixes: 2fbbd712ba ("net: mana: Enable RX path to handle various MTU sizes")
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Reviewed-by: Dexuan Cui <decui@microsoft.com>
Link: https://lore.kernel.org/r/1712087316-20886-1-git-send-email-haiyangz@microsoft.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-03 19:32:03 -07:00
Konstantin Taranov
02fed6d92b net: mana: add msix index sharing between EQs
This patch allows to assign and poll more than one EQ on the same
msix index.
It is achieved by introducing a list of attached EQs in each IRQ context.
It also removes the existing msix_index map that tried to ensure that there
is only one EQ at each msix_index.
This patch exports symbols for creating EQs from other MANA kernel modules.

Signed-off-by: Konstantin Taranov <kotaranov@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-12-15 10:24:53 +00:00
Colin Ian King
f422544118 net: mana: Fix spelling mistake "enforecement" -> "enforcement"
There is a spelling mistake in struct field hc_tx_err_sqpdid_enforecement.
Fix it.

Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Signed-off-by: Shradha Gupta <shradhagupta@linux.microsoft.com>
Link: https://lore.kernel.org/r/20231128095304.515492-1-colin.i.king@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-11-29 20:13:40 -08:00
Jakub Kicinski
7cc9e6d77f eth: link netdev to page_pools in drivers
Link page pool instances to netdev for the drivers which
already link to NAPI. Unless the driver is doing something
very weird per-NAPI should imply per-netdev.

Add netsec as well, Ilias indicates that it fits the mold.

Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-11-28 15:48:39 +01:00
Shradha Gupta
e1df5202e8 net :mana :Add remaining GDMA stats for MANA to ethtool
Extend performance counter stats in 'ethtool -S <interface>'
for MANA VF to include all GDMA stat counter.

Tested-on: Ubuntu22
Testcases:
1. LISA testcase:
PERF-NETWORK-TCP-THROUGHPUT-MULTICONNECTION-NTTTCP-Synthetic
2. LISA testcase:
PERF-NETWORK-TCP-THROUGHPUT-MULTICONNECTION-NTTTCP-SRIOV

Signed-off-by: Shradha Gupta <shradhagupta@linux.microsoft.com>
Link: https://lore.kernel.org/r/1700830950-803-1-git-send-email-shradhagupta@linux.microsoft.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-11-27 12:54:33 +01:00
Konstantin Taranov
f5247a6ed5 net: mana: Use xdp_set_features_flag instead of direct assignment
This patch uses a helper function for assignment of xdp_features.
This change simplifies backports.

Signed-off-by: Konstantin Taranov <kotaranov@microsoft.com>
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Link: https://lore.kernel.org/r/1698430011-21562-1-git-send-email-haiyangz@microsoft.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-27 15:39:29 -07:00
Haiyang Zhang
a43e8e9ffa net: mana: Fix oversized sge0 for GSO packets
Handle the case when GSO SKB linear length is too large.

MANA NIC requires GSO packets to put only the header part to SGE0,
otherwise the TX queue may stop at the HW level.

So, use 2 SGEs for the skb linear part which contains more than the
packet header.

Fixes: ca9c54d2d6 ("net: mana: Add a driver for Microsoft Azure Network Adapter (MANA)")
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Shradha Gupta <shradhagupta@linux.microsoft.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-10-05 11:45:06 +02:00
Haiyang Zhang
7a54de9265 net: mana: Fix the tso_bytes calculation
sizeof(struct hop_jumbo_hdr) is not part of tso_bytes, so remove
the subtraction from header size.

Cc: stable@vger.kernel.org
Fixes: bd7fc6e195 ("net: mana: Add new MANA VF performance counters for easier troubleshooting")
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Shradha Gupta <shradhagupta@linux.microsoft.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-10-05 11:45:06 +02:00
Haiyang Zhang
b2b000069a net: mana: Fix TX CQE error handling
For an unknown TX CQE error type (probably from a newer hardware),
still free the SKB, update the queue tail, etc., otherwise the
accounting will be wrong.

Also, TX errors can be triggered by injecting corrupted packets, so
replace the WARN_ONCE to ratelimited error logging.

Cc: stable@vger.kernel.org
Fixes: ca9c54d2d6 ("net: mana: Add a driver for Microsoft Azure Network Adapter (MANA)")
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Shradha Gupta <shradhagupta@linux.microsoft.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-10-05 11:45:06 +02:00
Shradha Gupta
ac3899c622 net: mana: Add gdma stats to ethtool output for mana
Extended performance counter stats in 'ethtool -S <interface>'
for MANA VF to include GDMA tx LSO packets and bytes count.

Tested-on: Ubuntu22
Testcases:
1. LISA testcase:
PERF-NETWORK-TCP-THROUGHPUT-MULTICONNECTION-NTTTCP-Synthetic
2. LISA testcase:
PERF-NETWORK-TCP-THROUGHPUT-MULTICONNECTION-NTTTCP-SRIOV
3. Validated the GDMA stat packets and byte counters
Signed-off-by: Shradha Gupta <shradhagupta@linux.microsoft.com>
Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-11 09:53:59 +01:00
Jakub Kicinski
4d016ae42e Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR.

No conflicts.

Adjacent changes:

drivers/net/ethernet/intel/igc/igc_main.c
  06b412589e ("igc: Add lock to safeguard global Qbv variables")
  d3750076d4 ("igc: Add TransmissionOverrun counter")

drivers/net/ethernet/microsoft/mana/mana_en.c
  a7dfeda6fd ("net: mana: Fix MANA VF unload when hardware is unresponsive")
  a9ca9f9cef ("page_pool: split types and declarations from page_pool.h")
  92272ec410 ("eth: add missing xdp.h includes in drivers")

net/mptcp/protocol.h
  511b90e392 ("mptcp: fix disconnect vs accept race")
  b8dc6d6ce9 ("mptcp: fix rcv buffer auto-tuning")

tools/testing/selftests/net/mptcp/mptcp_join.sh
  c8c101ae39 ("selftests: mptcp: join: fix 'implicit EP' test")
  03668c65d1 ("selftests: mptcp: join: rework detailed report")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-10 14:10:53 -07:00
Souradeep Chakrabarti
a7dfeda6fd net: mana: Fix MANA VF unload when hardware is unresponsive
When unloading the MANA driver, mana_dealloc_queues() waits for the MANA
hardware to complete any inflight packets and set the pending send count
to zero. But if the hardware has failed, mana_dealloc_queues()
could wait forever.

Fix this by adding a timeout to the wait. Set the timeout to 120 seconds,
which is a somewhat arbitrary value that is more than long enough for
functional hardware to complete any sends.

Cc: stable@vger.kernel.org
Fixes: ca9c54d2d6 ("net: mana: Add a driver for Microsoft Azure Network Adapter (MANA)")
Signed-off-by: Souradeep Chakrabarti <schakrabarti@linux.microsoft.com>
Link: https://lore.kernel.org/r/1691576525-24271-1-git-send-email-schakrabarti@linux.microsoft.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-10 10:27:58 -07:00
Yunsheng Lin
a9ca9f9cef page_pool: split types and declarations from page_pool.h
Split types and pure function declarations from page_pool.h
and add them in page_page/types.h, so that C sources can
include page_pool.h and headers should generally only include
page_pool/types.h as suggested by jakub.
Rename page_pool.h to page_pool/helpers.h to have both in
one place.

Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Link: https://lore.kernel.org/r/20230804180529.2483231-2-aleksander.lobakin@intel.com
[Jakub: change microsoft/mana, fix kdoc paths in Documentation]
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-07 13:05:19 -07:00
Haiyang Zhang
b1d13f7a3b net: mana: Add page pool for RX buffers
Add page pool for RX buffers for faster buffer cycle and reduce CPU
usage.

The standard page pool API is used.

With iperf and 128 threads test, this patch improved the throughput
by 12-15%, and decreased the IRQ associated CPU's usage from 99-100% to
10-50%.

Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-06 08:36:06 +01:00
Jakub Kicinski
92272ec410 eth: add missing xdp.h includes in drivers
Handful of drivers currently expect to get xdp.h by virtue
of including netdevice.h. This will soon no longer be the case
so add explicit includes.

Reviewed-by: Wei Fang <wei.fang@nxp.com>
Reviewed-by: Gerhard Engleder <gerhard@engleder-embedded.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Link: https://lore.kernel.org/r/20230803010230.1755386-2-kuba@kernel.org
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-08-03 08:38:07 -07:00
Long Li
da4e864807 net: mana: Batch ringing RX queue doorbell on receiving packets
It's inefficient to ring the doorbell page every time a WQE is posted to
the received queue. Excessive MMIO writes result in CPU spending more
time waiting on LOCK instructions (atomic operations), resulting in
poor scaling performance.

Move the code for ringing doorbell page to where after we have posted all
WQEs to the receive queue during a callback from napi_poll().

With this change, tests showed an improvement from 120G/s to 160G/s on a
200G physical link, with 16 or 32 hardware queues.

Tests showed no regression in network latency benchmarks on single
connection.

Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
Reviewed-by: Dexuan Cui <decui@microsoft.com>
Signed-off-by: Long Li <longli@microsoft.com>
Link: https://lore.kernel.org/r/1689622539-5334-2-git-send-email-longli@linuxonhyperv.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-07-18 18:00:13 -07:00
Linus Torvalds
7ede5f78a0 v6.5 merge window RDMA pull request
This cycle saw a focus on rxe and bnxt_re drivers:
 
 - Code cleanups for irdma, rxe, rtrs, hns, vmw_pvrdma
 
 - rxe uses workqueues instead of tasklets
 
 - rxe has better compliance around access checks for MRs and rereg_mr
 
 - mana supportst he 'v2' FW interface for RX coalescing
 
 - hfi1 bug fix for stale cache entries in its MR cache
 
 - mlx5 buf fix to handle FW failures when destroying QPs
 
 - erdma HW has a new doorbell allocation mechanism for uverbs that is
   secure
 
 - Lots of small cleanups and rework in bnxt_re
    * Use the common mmap functions
    * Support disassociation
    * Improve FW command flow
 
 - bnxt_re support for "low latency push", this allows a packet
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQRRRCHOFoQz/8F5bUaFwuHvBreFYQUCZJxq+wAKCRCFwuHvBreF
 YSN+AQDKAmWdKCqmc3boMIq5wt4h7yYzdW47LpzGarOn5Hf+UgEA6mpPJyRqB43C
 CNXYIbASl/LLaWzFvxCq/AYp6tzuog4=
 =I8ju
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma

Pull rdma updates from Jason Gunthorpe:
 "This cycle saw a focus on rxe and bnxt_re drivers:

   - Code cleanups for irdma, rxe, rtrs, hns, vmw_pvrdma

   - rxe uses workqueues instead of tasklets

   - rxe has better compliance around access checks for MRs and rereg_mr

   - mana supportst he 'v2' FW interface for RX coalescing

   - hfi1 bug fix for stale cache entries in its MR cache

   - mlx5 buf fix to handle FW failures when destroying QPs

   - erdma HW has a new doorbell allocation mechanism for uverbs that is
     secure

   - Lots of small cleanups and rework in bnxt_re:
       - Use the common mmap functions
       - Support disassociation
       - Improve FW command flow
       - support for 'low latency push'"

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (71 commits)
  RDMA/bnxt_re: Fix an IS_ERR() vs NULL check
  RDMA/bnxt_re: Fix spelling mistake "priviledged" -> "privileged"
  RDMA/bnxt_re: Remove duplicated include in bnxt_re/main.c
  RDMA/bnxt_re: Refactor code around bnxt_qplib_map_rc()
  RDMA/bnxt_re: Remove incorrect return check from slow path
  RDMA/bnxt_re: Enable low latency push
  RDMA/bnxt_re: Reorg the bar mapping
  RDMA/bnxt_re: Move the interface version to chip context structure
  RDMA/bnxt_re: Query function capabilities from firmware
  RDMA/bnxt_re: Optimize the bnxt_re_init_hwrm_hdr usage
  RDMA/bnxt_re: Add disassociate ucontext support
  RDMA/bnxt_re: Use the common mmap helper functions
  RDMA/bnxt_re: Initialize opcode while sending message
  RDMA/cma: Remove NULL check before dev_{put, hold}
  RDMA/rxe: Simplify cq->notify code
  RDMA/rxe: Fixes mr access supported list
  RDMA/bnxt_re: optimize the parameters passed to helper functions
  RDMA/bnxt_re: remove redundant cmdq_bitmap
  RDMA/bnxt_re: use firmware provided max request timeout
  RDMA/bnxt_re: cancel all control path command waiters upon error
  ...
2023-06-29 21:01:17 -07:00
Jason Gunthorpe
5f004bcaee Linux 6.4
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmSYzfYeHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiG/ucH/iOM/1Py/fSg0qSs
 7NJ4XXlourT5zrnRMom3cm3d9gYqgTzgvKFL3kjMEexTRVYbhlcO4ZPRsiry8zxF
 ToGX+V8tDMqb8WSdFHzkljRY+zDRyfEUDMlTzROAD9DunLmQtkJKyrggkeGdjkpP
 OyfGqKpwlLXZRAXBil/U8Mx9MHdjJubloZwghLZr33VdUZa68+JJ9l6w163Oe/ET
 K264NM0wxN/kvN57JvePgqMccQwpINylg8IhRI+XelgczjUXeJBsOA8TDv4bDN4Q
 bjCLhkWbIaZtTYqvOXa/kD0T8wd7KETsMBQN8YzyDh6W0GmAlJjTawyAhA6jA5in
 x3uz2W8=
 =L3zp
 -----END PGP SIGNATURE-----

Merge tag 'v6.4' into rdma.git for-next

Linux 6.4

Resolve conflicts between rdma rc and next in rxe_cq matching linux-next:

drivers/infiniband/sw/rxe/rxe_cq.c:
  https://lore.kernel.org/r/20230622115246.365d30ad@canb.auug.org.au

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2023-06-27 14:06:29 -03:00
Haiyang Zhang
b803d1fded net: mana: Add support for vlan tagging
To support vlan, use MANA_LONG_PKT_FMT if vlan tag is present in TX
skb. Then extract the vlan tag from the skb struct, and save it to
tx_oob for the NIC to transmit. For vlan tags on the payload, they
are accepted by the NIC too.

For RX, extract the vlan tag from CQE and put it into skb.

Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-06-12 09:38:31 +01:00
Long Li
2145328515 RDMA/mana_ib: Use v2 version of cfg_rx_steer_req to enable RX coalescing
With RX coalescing, one CQE entry can be used to indicate multiple packets
on the receive queue. This saves processing time and PCI bandwidth over
the CQ.

The MANA Ethernet driver also uses the v2 version of the protocol. It
doesn't use RX coalescing and its behavior is not changed.

Link: https://lore.kernel.org/r/1684045095-31228-1-git-send-email-longli@linuxonhyperv.com
Signed-off-by: Long Li <longli@microsoft.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2023-06-01 12:52:01 -03:00
Haiyang Zhang
1919b39fc6 net: mana: Fix perf regression: remove rx_cqes, tx_cqes counters
The apc->eth_stats.rx_cqes is one per NIC (vport), and it's on the
frequent and parallel code path of all queues. So, r/w into this
single shared variable by many threads on different CPUs creates a
lot caching and memory overhead, hence perf regression. And, it's
not accurate due to the high volume concurrent r/w.

For example, a workload is iperf with 128 threads, and with RPS
enabled. We saw perf regression of 25% with the previous patch
adding the counters. And this patch eliminates the regression.

Since the error path of mana_poll_rx_cq() already has warnings, so
keeping the counter and convert it to a per-queue variable is not
necessary. So, just remove this counter from this high frequency
code path.

Also, remove the tx_cqes counter for the same reason. We have
warnings & other counters for errors on that path, and don't need
to count every normal cqe processing.

Cc: stable@vger.kernel.org
Fixes: bd7fc6e195 ("net: mana: Add new MANA VF performance counters for easier troubleshooting")
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Reviewed-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/1685115537-31675-1-git-send-email-haiyangz@microsoft.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-05-30 12:05:22 +02:00
Haiyang Zhang
df18f2da30 net: mana: Check if netdev/napi_alloc_frag returns single page
netdev/napi_alloc_frag() may fall back to single page which is smaller
than the requested size.
Add error checking to avoid memory overwritten.

Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-24 18:08:34 -07:00
Haiyang Zhang
5c74064f43 net: mana: Rename mana_refill_rxoob and remove some empty lines
Rename mana_refill_rxoob for naming consistency.
And remove some empty lines between function call and error
checking.

Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-24 18:08:34 -07:00
Haiyang Zhang
80f6215b45 net: mana: Add support for jumbo frame
During probe, get the hardware-allowed max MTU by querying the device
configuration. Users can select MTU up to the device limit.
When XDP is in use, limit MTU settings so the buffer size is within
one page. And, when MTU is set to a too large value, XDP is not allowed
to run.
Also, to prevent changing MTU fails, and leaves the NIC in a bad state,
pre-allocate all buffers before starting the change. So in low memory
condition, it will return error, without affecting the NIC.

Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-14 08:56:19 +01:00
Haiyang Zhang
2fbbd712ba net: mana: Enable RX path to handle various MTU sizes
Update RX data path to allocate and use RX queue DMA buffers with
proper size based on potentially various MTU sizes.

Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-14 08:56:19 +01:00