Rocking horse shit, and what it takes to be a distributed systems engineer

Where have all the distributed systems engineers gone? It seems that rocking horse shit and distributed systems engineers are both rare, however right now we know where to find the horse faeces :)

Take for example a Google image search for “Distributed system engineer”:

The result: a few theories on the problem, our logo, but no engineers out there.

A Google image search for “Rocking horse shit” however produces the following:

Damn! The poop wins out here.


Whilst my image search is in jest, our search for a distributed systems engineer for Ably is not. We’ve learnt a lot from the the hundreds of candidates we have spoken to so far for this role, and we’d like to share our thoughts on what it takes to be a distributed systems engineer:

A micro services architecture is not a distributed system

You don’t have experience with distributed systems simply because you’ve used / designed a micro services architecture or because you have a (partially) horizontally scalable system. Take for example this design.

There’s not much “distributed” about this system. There are multiple hosts and network interconnections, but they are tightly coupled, and their network interactions are reliable, have low-latency, and are predictable. Genuinely distributed, in our view, means systems where nodes are distributed globally; network interactions are unpredictable and can create partitions; but nonetheless those nodes work together to create a predictable outcome. Distributed systems, at scale, involve state being distributed and re-balanced across the system, reacting as nodes are added and removed, and they do this in spite of the unpredictability that is inherent in a global system.

An understanding of a hash ring is a pre-requisite

If you think a hash ring has something to do with a criminal organization selling cannabis, then we’ll enjoy the chat, but unfortunately you’re probably not ready to be a distributed system engineer.

Credit: Mathias Meyer from Travis CI

If the above doesn’t look familiar, then I recommend you start diving into how popular distributed systems work which all rely on the ideas behind a consistent hash ring. See for example:

Gossip protocols and consensus algorithms underpin it all

Large distributed systems, usually have to track changes in cluster topology, in response to network partitions, failures, and scaling events. Various protocols exist to ensure that this can happen, with varying levels of consistency and complexity. This needs to be dynamic and real time because nodes come and go in elastic systems, failures need to be detected quickly, and load and state needs to be rebalanced in real time. With a stateful system like Ably, additionally state needs to be moved in real time between new and from old nodes whilst providing continuity throughout.

If you have never worked with Gossip or consensus algorithms, then I recommend you read up on:

Eventually consistent data types and read/write consistencies

Generally in a distributed system, locks are impractical to implement and impossible to scale. As a result, trade-offs need to be made between the consistency and availability of data. In many cases, for example, availability can be prioritised, and consistency guarantees weakened to eventual consistency, with data structures such as CRDTs.

If you’re not familiar with CRDT or Operational Transform, the concepts of variable consistencies for queries or writes to data in a distributed data store, then you’ve got some reading to do:

Deep network protocol understanding

In a distributed system, you’ll almost certainly be working within all layers of the networking stack. Whilst we rely extensively on various higher level protocols such as HTTP, WebSockets, gRPC and TCP sockets, without a deep understanding of those protocols and the full stack of protocols they rely on all the way down to the OS itself, then likely you’ll struggle to solve problems in a distributed system when things go wrong. Take for example the following request or WebSocket connection which would involve all of the following. At each layer, you should be confident in your understanding and ability to debug problems at a packet or frame level:

  • DNS protocol and UDP for address lookup.
  • File descriptors (on *nix) and buffers used for connections, NAT tables, conntrack tables etc.
  • IP to route packets between hosts
  • TCP to establish a connection
  • TLS handshakes, termination and certificate authentication
  • HTTP/1.1 or more recently 2.0 used extensively by gRPC.
  • WebSocket upgrades over HTTP.

And that’s not all…

It’s pretty clear why we’re finding it hard to find a distributed systems engineer. Conceptually there’s a lot to learn before you even get started on the fundamentals such as programming languages, general design patterns, version control, infrastructure, continuous integration and deployment systems.

Further reading

See an introduction to distributed systems by Kyle Kingsbury.

Interested in solving distributed problems in a distributed team?

Apply now for the position. We’re actively looking for remote (in Europe) or onsite (in London) team members to join our distributed engineering team.

Know someone interested in solving distributed problems and want to earn $3k for a minute of your time?

Given how it appears to be easier to find rocking horse shit than good distributed systems engineers, we’re trying all avenues! If you make a referral for an engineer we employ, we’ll send you $3k as a thanks. One email = $3k.

Not a distributed engineer but interested in working at Ably?

See our jobs board to see if we have any positions suitable.

Matthew O’Riordan, CEO, Ably
Like what you read? Give Matthew O'Riordan a round of applause.

From a quick cheer to a standing ovation, clap to show how much you enjoyed this story.