How to configure ZeroMQ ROUTER socket to keep trying to send packages for extended period of time?


448 观看


57 作者的声誉

I have a ZeroMQ ROUTER/DEALER pair of Formal Communication Archetypes, used for an asynchronous communication between two computers.

If the computer with the DEALER socket goes off-line for a while and comes back, some messages are lost.

I do understand that ZeroMQ can't hold the messages indefinitely as there is no guarantee the DEALER-side is ever coming back. I am looking for ways to configure this behavior - is there a setting I can use to control how long the messages are kept before giving up?

What settings might affect this behavior?

I don't think the issue is related to a value of the High Water Mark setting, as the amount of the data transferred is quite low.

ZeroMQ version 4.0.4 on Windows.

I'm not sure what exactly would be the relevant parts of code to show. Also, it is not entirely straight forward to lift things out of context and keep them understandable. I'll try, so here goes.

This is how the router socket is initialized:

 router-socket (doto (zmq/socket zmq-context :router)
                     (zmq/set-receive-timeout 100)
                     (zmq/set-recv-hwm 0) 
                     (zmq/set-send-hwm 0)
                     (zmq/bind (str "tcp://*:" (:port request-handler))))

Sending the messages uses zmq/send function.

Even with no clojure experience the zeromq parts should be clear.

Here's how the dealer is initialized (C#):

var dealer = context.CreateSocket( SocketType.DEALER );
dealer.SendHighWatermark = 0;
dealer.ReceiveHighWatermark = 0;    

Reception of the messages uses the ZmqSocket.ReceiveMessage method (well, actually it's an extension method in the SendReceiveExtensions class, but anyway).

One (possibly the main) case where the package loss occurs is when the computer running the dealer goes to sleep (=laptop lid closed) and is woken up some minutes afterwards. Like I stated in my original question, I am assuming that the packages are lost due to the router temporarily giving up on the dealer coming back and therefore discarding the messages. But this is only an assumption, the cause may be something else, too.

作者: Antti Karanta 的来源 发布者: 2017 年 12 月 27 日

回应 (1)


21616 作者的声誉

So, let's start with polishing touches on ZeroMQ Context()-engine settings, which is the ultimate authority for running all the low-level stuff of the smart-signaling / messaging in ZeroMQ, using the powers of the .setsockopt() method.

For a real-world troubleshooting, there is nothing like a one-size-fits-all, so without any code above, there are many things that may be approached by guesstimate and pieces of experience from past troubles, one has met.

While some of the root-cause things may actually get masked by some other habits of the ZeroMQ processes, running under the hood, the following text is more about an as broad view as possible onto the art of balancing acts, than a step-by-step navigation.

From just a few remarks above, would start with these suspects, from the long list of smart API options:

ZMQ_RECONNECT_IVL: Set reconnection interval

The ZMQ_RECONNECT_IVL option shall set the initial reconnection interval for the specified socket. The reconnection interval is the period ØMQ shall wait between attempts to reconnect disconnected peers when using connection-oriented transports.

Going shorter from ~ 100 [ms] to some 2 [ms] with adjusted ZMQ_RECONNECT_IVL_MAX to a few multiples thereof, may together with below mentioned strategies for surviving spurious LoS and similar service dropouts help in a reduced overhead latency during renewing the lost low-level connections.

Ref. also to ZMQ_TCP_MAXRT and O/S overrides ( available where O/S supported ) via ZMQ_TCP_KEEPALIVE_{CNT | IDLE | INTVL}.

This one will highlight the states, when peers are not connected, so that the message delivery strategy might get adjusted for such observed cases in the user-application code:

ZMQ_IMMEDIATE: Queue messages only to completed connections

By default queues will fill on outgoing connections even if the connection has not completed. This can lead to "lost" messages on sockets with round-robin routing ( REQ, PUSH, DEALER ). If this option is set to 1, messages shall be queued only to completed connections. This will cause the socket to block if there are no other connections, but will prevent queues from filling on pipes awaiting connection.

With a similar strategy, as was above posted for the DEALER-side, ROUTER side may use the ZMQ_PROBE_ROUTER setting, so as to bootstrap connections to ROUTER sockets

If principally possible and if cost-wise still reasonable, one may use sort of "service" sonar-beeps, injected in regular intervals:

ZMQ_HEARTBEAT_IVL: Set interval between sending ZMTP heartbeats

The ZMQ_HEARTBEAT_IVL option shall set the interval between sending ZMTP heartbeats for the specified socket. If this option is set and is greater than 0, then a PING ZMTP command will be sent every ZMQ_HEARTBEAT_IVL milliseconds.

ZMQ_HEARTBEAT_TIMEOUT: Set timeout for ZMTP heartbeats

The ZMQ_HEARTBEAT_TIMEOUT option shall set how long to wait before timing-out a connection after sending a PING ZMTP command and not receiving any traffic. This option is only valid if ZMQ_HEARTBEAT_IVL is also set, and is greater than 0. The connection will time out if there is no traffic received after sending the PING command, but the received traffic does not have to be a PONG command - any received traffic will cancel the timeout.

ZMQ_HEARTBEAT_TTL: Set the TTL value for ZMTP heartbeats

The ZMQ_HEARTBEAT_TTL option shall set the timeout on the remote peer for ZMTP heartbeats. If this option is greater than 0, the remote side shall time out the connection if it does not receive any more traffic within the TTL period. This option does not have any effect if ZMQ_HEARTBEAT_IVL is not set or is 0. Internally, this value is rounded down to the nearest decisecond, any value less than 100 will have no effect.

ZMQ_CONNECT_TIMEOUT: Set connect() timeout

Sets how long to wait before timing-out a connect() system call. The connect() system call normally takes a long time before it returns a time out error. Setting this option allows the library to time out the call at an earlier interval.

Setting just a few [ms] may help demask intermittent interruptions and/or open new service windows in between them. For allowing this short-window strategy work, one ought also reduce the maximum value permitted by the ZMQ_HANDSHAKE_IVL.

ZMQ_BACKLOG: Set maximum length of the queue of outstanding connections

The ZMQ_BACKLOG option shall set the maximum length of the queue of outstanding peer connections for the specified socket; this only applies to connection-oriented transports. For details refer to your operating system documentation for the listen function.

A hundred here may serve well enough, but without details about number of "lost" connections, let's keep it on the troubleshooters' shopping list.

作者: user3666197 发布者: 28.12.2017 05:22