Emergency 'epmd' recovery

Suppose you’re running an Erlang application (ejabberd, for instance). It’s been up for months and months but then you try to use a remote control script (like ejabberdctl) and it fails, probably saying things like “nodedown”, indicating it cannot communicate with the Erlang node; yet the application itself is apparently running just fine. Running out of ideas, you run epmd -names and, to your horror, it shows an empty list.

Every Erlang node uses a high port for communication, and epmd’s task is to map symbolic names like ejabberd to port numbers. (A bit like what DNS does for IP addresses, yes.) Without this information nodes cannot find each other. It’s quite interesting how epmd could just lose it, but right now it’s more important to find out how you can restore it without restarting your application. (Oh, the beautiful uptime!)

There are bits of information on the net that say epmd will gladly register any unused name with any port with an Erlang node attached when it’s told to do so, we just need a way to send the right packet. Turns out there’s the erl_epmd:register_node/2 function that does exactly what we need. Use netstat to find the port your orphan node is listening to, then start an anonymous Erlang node:

erl

It must be anonymous because you can only register once, and we’ll need to do it by hand instead.

erl_epmd:start().

Since the node is anonymous, we must start the gen_server ourselves.

erl_node:register_node(ejabberd, 23456).

…assuming these are the node name and the port you need. Voila! Check the epmd -names and rejo…

But not quite yet. The thing about epmd is, although it registers whatever it’s told without asking any questions, it will also keep track of the connection that issued the registration request. The moment you disconnect your anonymous node, the restored registration will be gone again. We need to find a way to sneak into the application node and take control from there.

Start one more node, this time with a name:

erl -sname repair

Ping the application node to establish the connection:

net_adm:ping(ejabberd@hostname).

The node will ask epmd “who’s ejabberd?” and receive the port we gave it a moment ago, so all is fine. Now, the thing about Erlang connections is that once they’re established by whatever means, including net_adm:ping/1, you don’t need epmd to talk to that node anymore. So now you can turn to the first anonymous node and kill it:

halt().

This will, among other things, break the connection to epmd and make the name ejabberd available for re-registration. Note that the connection we made from the second node is still very much alive, about time we used it:

rpc:call(ejabberd@hostname, erl_epmd, register_node, [ejabberd, 23456]).

Note that this is essentially the same call to erl_epmd:register_node/2 as before, but this time we do it on behalf of the remote node using rpc:call. This means that the node name is now associated with its rightful owner once again! You can stop the second node now:

halt().

Run epmd -names once more to make sure, and go on with your business.