Our MongoDB driver creates a new temporary connection, for every active
connection, to just do a single heartbeat test. There is configurable
delay between every heartbeat test. When the user has an EMQX cluster
with a MongoDB bridge (to a MongoDB replica set), there will be a lot of
connections. Furthermore, as MongoDB creates a log entry every time a
new connection is created, the log will be flooded with info about new
connection. One user have reported more than 1MB of log data in a 10
minute period.
This commit tries to fix this by increasing the default delay between
heartbeats. A better fix would be to change the MongoDB driver so that
it does not create a new connection just to do a heartbeat check, but
this is more complicated so we leave this to the future. We might also
swap out the current MongoDB driver to something better.
Fixes:
https://github.com/emqx/emqx/issues/9851
This fixes a crash with an error in the log file (see below) that
happened when the MongoDB authorization module queried the database. The
reason is that the collection name that was sent to the mongodb
connection was an atom. This is fixed by making sure it is not an atom.
2023-03-08T17:16:34.215523+01:00 [error] msg: query_mongo_error, mfa:
emqx_authz_mongodb:authorize/4, line: 95, peername: 127.0.0.1:53212,
clientid: client123, collection: mqtt_acl, filter: #{username =>
<<"emqx_u">>}, reason: {resource_error,#{msg => #{error =>
{error,{error_cannot_parse_response,{op_msg_response,#{<<"code">> =>
73,<<"codeName">> => <<"InvalidNamespace">>,<<"errmsg">> => <<"Failed to
parse namespace element">>,<<"ok">> => 0.0}}}},id =>
<<"emqx_authz_mongodb:3">>,name => call_query,request =>
{find,mqtt_acl,#{username => <<"emqx_u">>},#{}},stacktrace =>
[{mc_connection_man,reply,1,[{file,"mc_connection_man.erl"},{line,123}],
...]}, reason => exception}}, resource_id: <<"emqx_authz_mongodb:3">>
Fixes: https://github.com/emqx/emqx/issues/9783
Prior to this change, when EMQX daemon mode failed to start
it's not quite easy for users to understand what went wrong.
All the know is the node did not start in time
and then instructed to boot the node in 'console' mode wishing
for some logs.
However, the node might actuay be running, causing 'console' mode
to fail with a different reason.
With this change, after a filure of daemon mode boot,
we issue a diagnosis.
1. if node can not be found from ps -ef, instruct the user
to find information in erlang.log.N
2. if the node is found running, but not responding to pings
instruct the user to check if the node name is
resolvable and reachable
3. if the node is responding to pings but emqx app is not
running, then it's likely a bug. so the user is advised
to report a github issue.