Hi everyone,
I am facing a massive data discrepancy issue in my ElastiFlow/Kibana setup (running on AlmaLinux) regarding sFlow data from my Extreme Networks VOSS / Fabric Engine switches.
The Problem:
When performing a large file transfer (approx. 6.24 GB via Windows SMB) from a File Server to a Client PC, Windows consistently shows a transfer rate of 102 MB/s (utilizing the full 1G link). The transfer completes in just over a minute.
However, looking at the ElastiFlow Dashboards (specifically the “Top Talkers” and “Flows” Sankey charts) during that exact time window (filtered down to the “Last 15 minutes”), the total volume for this client/server connection caps out stubbornly at 1.4 GB.
The Linux server running the ElastiFlow collector 7.20 shows 0 packet receive errors and 0 receive buffer errors via netstat -su, meaning the OS is not dropping any incoming sFlow UDP packets. Furthermore, checking show qos cosq-stats cpu-port on the edge switch confirms 0 active packet drops on the CPU queues during the transfer.
My Switch Configuration (Extreme VOSS):
Has anyone experienced this specific behavior where sFlow traffic gets scaled down or capped at a specific fraction (1.4 GB instead of 6.24 GB) in ElastiFlow?
Any help or guidance on how to troubleshoot this would be highly appreciated.
Thanks in advance!
Do you see lower than expected sizes across all devices, or is it limited to specific devices / device types?
I would recommend checking the following:
- In the Flow Records dashboard, filter to your impacted device, and then expand one of the flow records. Verify if any sample rate fields are being populated with the correct sample rate. In the Table view of the data you can search for fields
- Check your configurations to see if the sample rate is being overwritten with a custom rate: Sample Rate Adjustments | ElastiFlow Documentation
- Verify if the packet counts seem accurate compared to the device packet counts or packets/s
Hi Daniel,
Thank you for the recommendations. Here are the step-by-step verification results based on your questions:
1. Scope of the Issue
The issue is not limited to a specific device, but rather affects the entire traffic path. Both the Edge switch (1G client ports) and the Core switches (25G server ports) show underreported traffic volumes inside the dashboards, sticking around the same ~1.4 GB mark for a 6.24 GB transfer.
2. Sample Rate Verification in Flow Records & Custom Config
I checked the Flow Records dashboard, expanded the raw documents, and verified the configurations:
-
The flow.in.packets field: In my ElastiFlow version shows a 1,024 value which is in my case correct.
-
Custom Configuration: To rule out any detection issues, I explicitly enabled the user-defined sample rate enrichment in flowcoll.yaml:
yaml
EF_PROCESSOR_ENRICH_SAMPLERATE_USERDEF_ENABLE: "true"
EF_PROCESSOR_ENRICH_SAMPLERATE_USERDEF_OVERRIDE: "true"
EF_PROCESSOR_ENRICH_SAMPLERATE_USERDEF_PATH: /etc/elastiflow/settings/sample_rate.yml
Inside sample_rate.yml, I manually hardcoded the sampling rates for the respective flow exporters (1024 for the Edge and 4096 for the Cores). After restarting the collector and re-running the 6.24 GB transfer, the dashboard still capped out at exactly 1.4 GB.
Any other idea what could be wrong?
So the flow.in.packets field should be the total number of packets that that sflow record has details for, scaled by the sampling rate. The sampling rate should be shown in flow.meter.packet_select.interval.packets.
To get the total number of bytes from the sflow records we’re using the frame length, which can get truncated. This documentation indicates that the default maximum header size is 128 bytes, which could result in truncated packets and inaccurate bytes counts. Can you try bumping this up to 256 bytes to see if your total reported byte size changes?