We’ve all had cases where we need to communicate between computers or people electronically. There are a plethora of tools built on various protocols that suit the task to varying definitions of well. I’ve decided to throw my hat into the matter with a real simplified protocol built for this purpose.
‘Why would you want to do this?’ you might ask. ‘What’s wrong with X, Y or Z?’ you might also ask. The fact is, for your purposes, other tools may be better, I built this because I believe it to be better for my purposes.
So what are my purposes you might ask? Well, I’m setting out to build an inherently fault-tolerant local and remote communication protocol. Nothing extremely fancy or clever, just something simple and well suited for the task.
Anyway, enough of that, let’s get into some details.
How does it work?
After deciding what I wanted to build, I went thinking about how I wanted to do it. Almost immediately traditional peer-based networks came to mind as being something reasonably well suited to what I wanted to build. However, most P2P overlay networks tend to be unable to guarantee there’s no data loss when half of the network + 1 node disappear. This is a problem for me. I want the network to appear up to every node, even if it’s entirely detached.
I know I could have solved this appearances problem by using local caching, but in much the same way as Skype, for instance, locally caches, if you go offline before the other user comes online, they won’t get your messages until you’re both online at the same time. This is a problem I’d like to avoid.
The network is structured from the vantage point of a node, as a member of a swarm. A swarm is an independent group of nodes which communicate with one another. All the nodes in a swarm do not necessarily connect to the same node, they can connect to whomever they wish to. A swarm is the formation of all the nodes connected however they are.
You can probably infer, at least for those of you who have built such systems in the past, that the ideal operating mode for this network in terms of latency and ease of finding other nodes, would be limited to a local area network if no bootstrapping process occurs, and you’d be right. Locally, Swarm looks for any other swarm services advertised locally, and connects to those nodes. The ultimate result here, is a directed acyclic graph (DAG for short) that forms the swarm.
The nodes communicate between one another by multicast (primarily) or unicast (explained later) messaging. Messages are constructed with a given purpose, and its payload (if applicable). These messages, upon receipt by another node, it checks if it’s got a message with the given UUID before, if so it ignores the message. Otherwise, it will look at all the nodes it is directly connected to, forward this message on, and record that it has seen the message before it does this.
Let me stop the explanation at this point and explain that how this works was designed for smaller to mid-sized swarms, and not 10 thousand+ nodes in one swarm. While it’s certainly possible to create very large swarms, that’s not what this was designed for.
What will happen is the last node to receive the message, is:
- It will forward to all its connected peers;
- Each node that then receives the message, will record it;
- Nodes don’t care if the receiver received the message.
If the intended receiver later connects, it can request a replay of all the messages sent since a given time. It does not have to request from a specific node, since all nodes in the swarm will receive the message, any node will do.
A swarm is inherently multicast. Implementing unicast semantics on top of that is a bit hard given the current way the network works.
To alleviate this problem, we can use use external direct connections to retrieve data.
Let’s assume we are building a chat application and in this chat application, someone can drop a youtube video into the chat. We can recognize this and embed an image of the video into the stream. The receiving node would go out to the internet, and fetch this image. Additionally, when the user clicks on the image to start the video, we’d make another connection out to stream it. These connections operate externally to the control network.
While up to this point we’ve just talked about local connections, the same rules apply for remotes. There may be servers in different data centres which need to communicate with one another in the same swarm. A node may open a socket to these remote services as well on a public listening port, to join two swarms together.
This will be a single point of failure, so if either node X or Y go offline, communications between swarms R and S will be disconnected.
For myself, I’m considering approaching implementing remotes using DNS-SD so I can use bonjour locally and remotely. Additionally, I’m considering using true multicast groups instead of a DAG where possible once I can explore the possibilities of doing so in a fault-tolerant way.
Swarm is an open source protocol to help connect your nodes, whatever they happen to do, in a fault-tolerant manner.
While message replay is not presently implemented, you can fetch Swarm on my GitHub.