Elastiflow Opensearch Output Retry Mechanism Not Working

Dear Elastiflow Team ,

I have installed Elastiflow version 7.3.2 and configured my Output as Opensearch cluster having 3 nodes each version 2.15.0 ,

When my opensearch is down for 2 hours, and when it comes up after the 2 hours time gap , I was able to see the data that my elastiflow has received during the Outage.

I clearly checked the logs , which suggests that elastiflow has an in memory buffer where it stores the messages that were failed to be sent to opensearch.

#EF_OUTPUT_OPENSEARCH_ADDRESSES: 127.0.0.1:9200
#EF_OUTPUT_OPENSEARCH_ALLOWED_RECORD_TYPES: as_path_hop,flow_option,flow,ifa_hop,telemetry,metric
#EF_OUTPUT_OPENSEARCH_AWS_ACCESS_KEY: “”
#EF_OUTPUT_OPENSEARCH_AWS_REGION: “”
#EF_OUTPUT_OPENSEARCH_AWS_SECRET_KEY: “”
#EF_OUTPUT_OPENSEARCH_BATCH_DEADLINE: 2000
#EF_OUTPUT_OPENSEARCH_BATCH_MAX_BYTES: 8388608
#EF_OUTPUT_OPENSEARCH_CLIENT_CA_CERT_FILEPATH: “”
#EF_OUTPUT_OPENSEARCH_CLIENT_CERT_FILEPATH: “”
#EF_OUTPUT_OPENSEARCH_CLIENT_KEY_FILEPATH: “”
#EF_OUTPUT_OPENSEARCH_DROP_FIELDS: “”
#EF_OUTPUT_OPENSEARCH_ECS_ENABLE: “false”
#EF_OUTPUT_OPENSEARCH_ENABLE: “false”
#EF_OUTPUT_OPENSEARCH_INDEX_PERIOD: daily
#EF_OUTPUT_OPENSEARCH_INDEX_SUFFIX: “”
#EF_OUTPUT_OPENSEARCH_INDEX_TEMPLATE_CODEC: best_compression
#EF_OUTPUT_OPENSEARCH_INDEX_TEMPLATE_ENABLE: “true”
#EF_OUTPUT_OPENSEARCH_INDEX_TEMPLATE_ISM_POLICY: elastiflow
#EF_OUTPUT_OPENSEARCH_INDEX_TEMPLATE_OVERWRITE: “true”
#EF_OUTPUT_OPENSEARCH_INDEX_TEMPLATE_PIPELINE_DEFAULT: _none
#EF_OUTPUT_OPENSEARCH_INDEX_TEMPLATE_PIPELINE_FINAL: _none
#EF_OUTPUT_OPENSEARCH_INDEX_TEMPLATE_REFRESH_INTERVAL: 10s
#EF_OUTPUT_OPENSEARCH_INDEX_TEMPLATE_REPLICAS: 1
#EF_OUTPUT_OPENSEARCH_INDEX_TEMPLATE_SHARDS: 3
#EF_OUTPUT_OPENSEARCH_MAX_RETRIES: 3
#EF_OUTPUT_OPENSEARCH_PASSWORD: admin
#EF_OUTPUT_OPENSEARCH_RETRY_BACKOFF: 1000
#EF_OUTPUT_OPENSEARCH_RETRY_ENABLE: “true”
#EF_OUTPUT_OPENSEARCH_RETRY_ON_TIMEOUT_ENABLE: “true”
#EF_OUTPUT_OPENSEARCH_TIMESTAMP_SOURCE: collect
#EF_OUTPUT_OPENSEARCH_TLS_CA_CERT_FILEPATH: “”
#EF_OUTPUT_OPENSEARCH_TLS_ENABLE: “false”
#EF_OUTPUT_OPENSEARCH_TLS_SKIP_VERIFICATION: “false”
#EF_OUTPUT_OPENSEARCH_USERNAME: admin

It is cleary mentioned in the documentation that Elastiflow retires the timed out bulk index requests which is ie. “EF_OUTPUT_OPENSEARCH_MAX_RETRIES: 3” configured to 3, and retry back off time is configured to 1second ,
Note: the elastiflow never stops even though the retry limit has crossed the Max retry limit.

With the above configuration , ideally the data that was recieved by elastiflow during the outage should have been discarded , since it should retried max number of times and opensearch was unavailable.

I have few questions

  1. How do i control the in memory buffer settings
  2. Why is the retry mechanism not working properly, or is my configuration wrong?

Please do help me out , Thanks.

That is certainly unexpected behavior. As far as I know there is no “buffer” for data to be sent to the output, particularly one that retains data for 2 hours. We queue the data to be sent to the output, but AFAIK we would not “buffer” 2 hours of data.

You say the Opensearch cluster has 3 nodes. Were all 3 nodes down for the full 2 hours?

Also, all of the Opensearch configuration you show is commented out, so what is the actual configuration? Did you see anything in the logs to indicate that you had lost connection to Opensearch?

Regards,
Dexter

Thank you for speedy reply , really appreciate that.

Firstly regarding the Elastiflow configuration i have printed them onto the console and hence the hash has come , expect hosts i did not override any of the values that were default

Secondly , i have configured opensearch in such a way that if 2 nodes are down my cluster is down and i have tried both the scenarios , having 2 nodes down and 3 nodes down , but i was able to get the data when the opensearch was down.

I have some questions?,
1.how can i controll the queue size.
2. Where does the data to be retried stored , is it in that same queue?
3. I could not see any log information regarding the data being discarded after maximal retries, could you suggest any configuration that i can enable to get that info as well.

In my test environment, as soon as I stop opensearch, I see the following messages in the log file. Note that there are 3 “retrying request” messages before logging the failed connection again.

2025-02-05T17:04:08.348Z	info	flowcoll.monitor_pool	monitor/pool.go:53	Monitor Output: decoding rate: 1090 records/second
2025-02-05T17:06:08.353Z	info	flowcoll.monitor_pool	monitor/pool.go:53	Monitor Output: decoding rate: 1087 records/second
2025-02-05T17:06:54.416Z	info	flowcoll.opensearch_output[default].http_connection_manager.connection_worker[0]	httpretry/httpretry.go:70	retrying request	{"attemptCount": 2, "address": "https://10.101.3.22:9200/_bulk"}
2025-02-05T17:06:54.618Z	info	flowcoll.opensearch_output[default].http_connection_manager.connection_worker[0]	httpretry/httpretry.go:70	retrying request	{"attemptCount": 3, "address": "https://10.101.3.22:9200/_bulk"}
2025-02-05T17:06:54.821Z	info	flowcoll.opensearch_output[default].http_connection_manager.connection_worker[0]	httpretry/httpretry.go:70	retrying request	{"attemptCount": 4, "address": "https://10.101.3.22:9200/_bulk"}
2025-02-05T17:06:54.822Z	error	flowcoll.opensearch_output[default].http_connection_manager.connection_worker[0]	httpoutput/conn_worker.go:101	error while sending request: error sending request: check response failure: Post "https://10.101.3.22:9200/_bulk": dial tcp 10.101.3.22:9200: connect: connection refused
github.com/elastiflow/flowcoll/pkg/outputs/utils/httpoutput.(*ConnWorker).Run
	/tmp/collectors/pkg/outputs/utils/httpoutput/conn_worker.go:101
2025-02-05T17:06:54.822Z	info	flowcoll.opensearch_output[default].http_connection_manager.connection_worker[0]	httpoutput/conn_worker.go:105	connection marked as dead{"address": "10.101.3.22:9200"}
2025-02-05T17:06:54.822Z	info	flowcoll.opensearch_output[default].http_connection_manager.connection_worker[0]	httpoutput/conn_worker.go:79	no alive connections available
2025-02-05T17:06:57.824Z	info	flowcoll.opensearch_output[default].http_connection_manager.connection_worker[0]	httpoutput/conn_worker.go:79	no alive connections available
2025-02-05T17:06:58.705Z	info	flowcoll.opensearch_output[default]	httpoutput/healthcheck_runner.go:50	attempting healthcheck	{"address": "10.101.3.22:9200"}
2025-02-05T17:06:58.707Z	error	flowcoll.opensearch_output[default]	httpoutput/healthcheck_runner.go:63	healthcheck failed; connection is unavailable	{"address": "10.101.3.22:9200", "error": "error while performing request: Get \"https://10.101.3.22:9200\": dial tcp 10.101.3.22:9200: connect: connection refused"}
github.com/elastiflow/flowcoll/pkg/outputs/utils/httpoutput.(*HealthCheckRunner).RunHealthCheck
	/tmp/collectors/pkg/outputs/utils/httpoutput/healthcheck_runner.go:63
2025-02-05T17:07:00.825Z	info	flowcoll.opensearch_output[default].http_connection_manager.connection_worker[0]	httpoutput/conn_worker.go:79	no alive connections available

I stopped Opensearch at approximately 17:07:00. While it was down the flow collector was running and I was sending test packets at a rate of about 1000 per second. I restarted Opensearch at 17:14:30 and see this message in the flow collector log:

2025-02-05T17:14:48.985Z	info	flowcoll.opensearch_output[default]	httpoutput/healthcheck_runner.go:50	attempting healthcheck	{"address": "10.101.3.22:9200"}
2025-02-05T17:14:49.330Z	info	flowcoll.opensearch_output[default]	httpoutput/healthcheck_runner.go:70	healthcheck success; connection is available	{"address": "10.101.3.22:9200"}
2025-02-05T17:16:08.348Z	info	flowcoll.monitor_pool	monitor/pool.go:53	Monitor Output: decoding rate: 763 records/second

Note that I only log the decoding rate every 2 minutes.

Attached is a screenshot of the flow records data from the “outage” period. It has a several minute gap as expected. Note that this is only a single node setup and has only one flow collector running.

I can’t replicate the behavior you are describing so I’m not sure how to advise further, but I hope the log messages and screen shot provide some insight into the issue for you.

Regards,
Dexter

I switched off opensearch at 9:04 am and switched on at 10:04 am
But i could see the data .
I have two questions,

  1. When you were 1000 flow/s , did you observe data being discarded after 3 retries in the logs ?
  2. Which version of Elastiflow were you using.

I am running Elastiflow 7.3.2 docker image in a K8 cluster.
i was able to see the same log messages as yours i.e the retrying attempts etc

I was also getting the log messages related to buffer, Did you? such as

As you can clearly see there are continuous buffer messages being picked up as soon as the opensearch is available , i believe that is where the data was stored instead of being discarded.

I’m running ElastiFlow 7.6.0 on a native Ubuntu installation. I do not see “buffer” messages probably because I am not running with ‘debug’ log level.

My observation was that no data was received in Opensearch while my single node was down. I don’t have any indication in the logs that data was discarded; I just know it did not arrive at Opensearch.

At such low volumes as you are showing I guess it might be possible that the records stay “in the queue” waiting to be sent and not discarded, but I can’t validate that assumption.