In my last post I lamented the pain felt applying object orientation to distributed architectures but I didn’t talk about why. To understand this, it helps to look back to Peter Deutch’s fallacies of distributed computing (If you want to learn more about these fallacies go here)

One of the first systems I developed was a wireless restaurant point of sale system. A key feature was waitstaff could send customer orders wirelessly with handheld devices to a backend server. This sounds like a simple enough system to build. And it is at development time with a device emulator running on the same machine as the application server. With this setup, it’s easy to fall into a trap. We code oblivious to Peter’s first fallacy, that the network is reliable, because we abstracted the network away somewhere. All is fine until deployment, because that is when Microwave ovens disrupt 802.11B devices, staff trip over cords and send PCs or switches offline and phone systems go in that compete with your wireless signal. That is when impenetrable cinderblock walls and the electric surges of heavy machinery wreak havoc on your network.

RPC (remote procedure call) technologies make the situation worse. Network is down? You get a timeout error. Want to retry? Implement it yourself.

I did that with the restaurant system. I devised retry logic so that waitstaff could move back into the wireless range and try sending the order again if it failed the first time. But this introduces another problem with RPC over unreliable networks. With RPC there are two distinct outcomes for a timeout. First, the RPC call may have never reached the server. In the case of our restaurant system, we would have hungry, angry customers unless the system retries to send the order. But what about the second outcome? What if the order did reach the server and the timeout occurred on the way back? If this is the case, and the waitstaff retries, the order is received twice. This means we requested double food (unless our services are idempotent)!

So is RPC over networks always bad? Actually many successful, distributed systems have been built on RPC. To say “the network is unreliable” is too binary. Reliability is a range with my scary restaurant network on one side and mission critical no single point of failure networks on the other. The abstractions afforded by RPC are acceptable for many non mission critical systems. Yet if there is another pattern, another abstraction that’s not much more complicated and does bring more reliability, shouldn’t we look to it?

When it has to be as reliable as possible, there are two choices. First, you can implement socket level communication. This means doing the connects, retries, and reconnects yourself. I’ve done that when implementing clients for mainframe messaging. It’s not terribly fun but it is effective. Second, you can use a reliable messaging pattern to smooth out the inherently unreliable network. In a future post I’ll talk about what patterns are applicable.

In the end, there is no 100% reliability once you introduce a network. This is clearly illustrated by the Two Generals Problem. But that doesn’t mean we should throw up our hands. It means we need to do what we’ve always done as IT professionals – make good tradeoffs and think deeply about when and where we need specific nonfunctional requirements such as levels of reliabiity.