Envoy Service Discovery with Hein Oldewage • Surfacing.RR

Phlippie Bosman

21 October 2019

5 min read

Engineering

This is the first post in a new series called Surfacing.RR. In this series, we will uncover nuggets of knowledge buried deep within the projects and teams at Retro Rabbit, and bring them to the surface.

In this first instalment of the Surfacing.RR series, we talk to Hein Oldewage about Envoy Proxy’s service discovery feature. Hein currently writes backend code for one of our mobile app projects, Kalido. One of Kalido’s many interesting implementation details is that backend/frontend communication is done using gRPC. The services that are available on the backend are defined as individual gRPC services, which are essentially microservices. This means that the Kalido server team needed a proxy to route frontend requests to those services. They settled on the very cool Envoy Proxy. After using and loving it for a while, the team decided to wire up the proxy’s service discovery feature.

I asked Hein about Envoy Proxy, service discovery, and why it’s worth knowing about.

Phlippie: Let’s just talk about Envoy first. How does it work?

Hein: With Envoy, there are four important concepts to grasp: listeners, clusters, routings, and endpoints. Those are the components needed to connect client requests to server services.

Listeners are the ports on which clients connect. They serve as the entry-point for a client connection.

A cluster is a collection of similar servers or endpoints. Clusters are defined by us, and we can define them however we want. The obvious way to define our clusters is to group similar services — for example, all chat-related services would live in one cluster. Interestingly, we can define a cluster “wrong”, in a way that it wouldn’t work, by including endpoints that shouldn’t be there — such as profile-related services on a chat cluster. Those services would then be unreachable to the client.

Routings connects listeners to clusters. A routing is defined as a condition that a request should meet, and the cluster that the request should be sent to if it meets that condition.

And lastly, an endpoint is a port on a server that receives and handles a request.

So with those building blocks, an Envoy proxy can be configured to send client requests to their matching services. Before we set up service discovery, this was a bit of a pain point. This configuration had to be saved in a config file. Our config file was over a 1000 lines long, and we constantly had to keep it up to date to ensure that our services were reachable. Any time we scaled our services up or down, we had to edit the file. It was very manually intensive. Moreover, whenever we deploy, each server temporarily goes down. During that time, we really shouldn’t route any traffic to that server. But temporarily changing our config to point requests to other servers for each deploy is just not feasible.

Phlippie: So this is where service discovery comes in?

Hein: Yes. Service discovery replaces the config file with, well, automatic service discovery. It obtains the config from its own service. No need to edit that huge config file anymore!

Service discovery also has the nice property that it produces a configuration that is eventually consistent.

Phlippie: How did you go about setting it up?

Hein: In our first implementation, we had almost everything static — the listeners, clusters and routes were still hard-wired. We just made it so the endpoints within the clusters are dynamically discovered. That setup was good enough to address our problems at the time. Now we could scale the services up or down within the clusters, and service discovery would make sure all the traffic ended up at the services.

But then we started implementing a backend-for-frontend system. We needed a new configuration with a new route to a new executable on a new machine. So we did a second implementation of service discovery where only the listeners are static. Now we can add whole new service binaries with minimal reconfiguration.

And we’re already thinking about what a third implementation might look like. In our backend-for-frontend system, we have a new core layer that is only meant to be accessed by the BFF layers, which should be hidden from the wider outside world. Envoy allows us to set up listeners that are only exposed internally to enforce this encapsulation logic. If we do that, we would probably make those listeners dynamically configured as well, using service discovery.

Phlippie: Aside from the reduced paperwork, were there any unexpected advantages to using service discovery?

Hein: So Envoy was originally built by an infrastructure team at Lyft. This team’s focus was just to make sure that their microservices work, and that it’s simple to set up. We are now also benefiting from that focus with our new backend-for-frontend system; when a frontend team needs a new server on our side, their server only has to declare itself. Service discovery then does all the heavy lifting of making sure their client calls end up where they should. Service discovery basically automates the dev ops for us.

We also have plans for really cool CI that we could do using service discovery. We could reduce our deployment cycle dramatically. Instead of blocking on code reviews and doing maybe one deploy per day, we could literally deploy every commit. Like, we would have automated requirements based on test coverage, for example, but every time someone commits code, we could deploy it. But here’s where it gets cool. We could deploy the new commit alongside the old one. Then Envoy would monitor the error rates resulting from that deploy. If it finds that we start getting higher error rates, it would trigger an alert and roll back. It could even detect whether other modules start experiencing more errors, and only roll back the newly deployed module. It’s testing in production, but fully automated.

We could also do load testing in this way. The newly deployed version would start out taking, say, only 10% of traffic. If the error rate is acceptable, Envoy starts scaling up the load. Some issues only appear once a service takes on a lot of traffic; if the new commit is susceptible to something like that, Envoy would tell us when we hit a critical load.

That would be great for catching the kind of bug we recently experienced, as an example. We deployed a version with a bug where, every time a client hung up (which happens on 100% of all calls), we treated it as an error. Our error rates shot through the roof. Envoy could have detected that this release was giving us 100% error rate on all calls, alerted us, and rolled back until we fixed it.

So despite deploying each commit, which sounds risky, this would actually be a very safe setup where the new code is tested before it is fully unleashed upon the world, and any issues are automatically handled. And for any issues that crop up during a deploy, we would have very actionable data on what went wrong, such as which errors started appearing in which part of the system, and at which load level.

We’re not quite ready to set this up though. One big hurdle that we still need to figure out is how to handle database migrations. We can roll back code if things go wrong, but we shouldn’t roll back migrations that would cause users to lose their data. And we’re not sure how to handle different versions of the server running at the same time, where those versions expect different database schemas. If we figure that out, maybe we should do a follow-up blog post!

About the author

Phlippie Bosman

Retro Rabbit's resident iOS hipster with a passion for clean and scalable coding. For fun, I like to taste fine whisky, win pub quizzes, and make noisy garage punk with my friends. Read more from Phlippie Bosman...