Managing Unplanned Traffic Surge – How Edgegap’s Orchestrator Handles Scaling for your Game
You have added Edgegap game servers and orchestration to your game (congrats!) and now you are ready to take the next step – launching your multiplayer game server hosting in the real world.
This can feel daunting. What if it doesn’t scale like for so many major projects?
This guide is meant to help you prepare for your game launch by highlighting how Edgegap takes care of a surge in traffic to ensure your game scales in-sync with player demand.
This article should be used in tandem with our “Multiplayer Game Pre-Launch Checklist”.
How does Edgegap manage unplanned spike load on its platform?
First and foremost, we handle each deployment as a distinct entity on our platform. Every deployment is a pod within our infrastructure. Requests originate from various sources, including the game clients themselves, the matchmaker, or any other means through which users interact with our API.
Once a request gets to our system, we immediately initiate telemetry and geolocation of the player and assign tasks to our workers to provision a server on our infrastructure. Within a second or two, the container is up and running, ready for incoming players.
How do we ensure all the necessary infrastructure is in place to serve these deployments promptly?
We’ve developed an abstraction layer overseeing 17 different providers, spanning from cloud, bare metal, edge and Container –as-a-Service providers. Those are located worldwide, providing us across more than 600 data centers.
Our list of providers mix those with worldwide service, such as Amazon, Google, and Azure, alongside localized service specific to a country/region. This complementary approach ensures you have the best coverage and thus, your game can tap into the world’s largest distributed edge network in the world.
Within our platform, these entities are known as Access Points, with essential information such as capacity, current workload, available resources, and other relevant metrics for decision-making. By creating this abstract layer, you do not need to worry about where your players will come from, or which provider you need to work with.
This allows you to focus on the game, while Edgegap focuses on orchestrating and managing the underlayer.
What are Edgegap’s platform priorities, and in which order?
Edgegap’s platform will prioritize the following elements, in this order:
Provide you with a functioning game server, based on the requirements you specify in your application profile
Get the game server as close as possible, with the shortest networking path, from the players to provide the best player experience, based on a variety of telemetry measurements.
Get you a functioning game server as quickly as possible. Objective being 1-2 seconds.
When we get a surge (and we mean, a HUGE amount of traffic in a very SHORT amount of time; e.g., 40+ deployments per second, which means 1,300 players trying to get in a match in a 32 player game every seconds, that’s 4.6 million players in 60 minutes), the list above will become harder to provide for the first few minutes of the surge. In those rare cases, the item below the list will be the first one to get dropped.
How long does it typically take for the platform to stabilize in case of unplanned surge of traffic?
The platform will scale up, in many ways, based on the traffic. Would the number of requests be above the current capacity, it can take up to 10-15 minutes for the platform to stabilize and provide back the 3 priorities above to each request.
What is Edgegap’s approach when faced with a sudden, unplanned, surge in traffic?
Our Capacity Manager (nicknamed “Capman”) is primed to react to incoming traffic, interfacing with the workers requesting resources.
When it detects an increase in demand, it automatically scale up specific regions where the resources are needed by deploying larger machines. Also, if a geographical area requiring additional support is detected, it swiftly provisions more machines to optimize latency and enhance the gaming experience for our clients; all of this is done automatically.
When our platform sees spike of traffic, it scales up the backend to support the traffic. For larger spikes where a lot of CPU are required (i.e. a game server which require 2 CPUs for each instance), it takes between 2 to 5 minutes for the platform to scale and stabilize. After that, the system scales up much faster since it requests larger servers. It will change the size of the new servers it requests to each provider based on the influx of traffic, requesting small or large servers, whether the influx accelerate or slow down.
TLDR; in case of unplanned traffic spike, (again, we're not talking about standard daily up and down, but hundreds of thousands of players), it will take a few minutes for the system to stabilize.
Additional notes: It will not happen if we know in advance of a certain event (launch/patches/etc) and for normal scaling up and down (typical 24 hours traffic). That’s mostly for unplanned spike; i.e. a streamer starts playing your game and a ton of new players join the fun.
Why is this not happening for standard scale up traffic over 24 hours, and only for UNPLANNED surge?
To manage this influx of traffic, our platform is designed to dynamically scale towards the west of the active region.
By leveraging predictive analytics and monitoring traffic patterns, we anticipate when and where traffic spikes are likely to occur. As such, our infrastructure proactively adjusts its capacity in anticipation of these peak periods.
This proactive scaling ensures that the necessary resources are readily available precisely when and where they are needed, optimizing the user experience and maintaining seamless gameplay, even during periods of high demand
What is the behavior when Edgegap is low on access points and newly allocated access points are from a region far away from a player?
When a deployment is requested, the platform makes the best decision on what resources are available. If the platform receives an unplanned large number of requests for a specific region, resources can become scarce in that region for a few minutes (while the system scales up).
In such case, and only for a few minutes, the decisions made will be slightly further away from where the initial decision was made.
The objective is to always have capacity, regardless of the situation. Those worst-case scenarios happen rarely and only last for a few minutes, as the system will scale up and start learning the new behavior.
What can I do to get the best performance from Edgegap’s platform?
Here are some recommendations of a few things you can do to get the best out of our platform ahead of a planned surge in traffic.
Here is the checklist:
Communicate!
If you know of something happening in your game, i.e. launch, patches, tournaments, streamers, let the team know about it.
The more they know, the faster they can react. We have the capability of pre-heating the backend environments to get large number of servers ready for launch time.
This is only recommended for planned large event, as the platform will start learning the traffic pattern over time (i.e. low and high traffic tide over 24 hours) and the system will scale up and down accordingly.
To avoid having timeout errors, you can increase the value “max_time_to_deploy” of your AppVersion either through the dashboard or our API.
The timeout is the amount of time our system will wait before flagging a game server (a deployment) in error. If your game goes from 0 to 1 million CCU in 10 seconds, the deployment time may be increased slightly for a few seconds.
Having this timeout slightly higher will mitigate the case where deployments are taking a little bit longer for traffic surge reason and they get flagged as in error by our platform.
We recommend to setup your system (i.e. lobby, matchmaker) to retry after a few seconds would this happen.
While the platform will always do everything possible to get you a game server through a simple API call, a very (and we mean VERY!) large of traffic flux in a very short amount of time may end up forcing our system to return an error message.
While talking about disaster recovery plan may seem counterintuitive when writing a document about performance, we feel like covering every, single, aspect of every, single, scenario is at the heart of what we do and how we operate.
Our system is based on a microservices architecture, with highly resilient components.
Those components are vendor agnostic and can run on any provider and can be easily deploy/redeploy to add capacity, or to reinstall a new platform in case of catastrophic failure.
On top of that, our architecture is built in a way where non-critical components are not required for the main service (i.e. orchestrating and hosting game servers) are the upmost priority.
Prepare with Load Test!
There’s no better way to assess the behavior of your game’s infrastructure, from Edgegap to game services, than with a load test.
We strongly recommend running ones ahead of your launch, as it is part of our pre-launch checklist. Check the article for more details.
How do we maintain competitive pricing?
As the traffic stabilizes, our automated systems will promptly take action to scale down our infrastructure while ensuring we maintain the necessary capacity to serve our players efficiently and economically. This involves vigilant monitoring of traffic patterns and usage trends, leveraging sophisticated algorithms and predictive analytics to anticipate demand fluctuations and adjust our infrastructure dynamically.
Furthermore, our cost optimization strategies encompass more than just downsizing infrastructure. We continually evaluate and optimize our utilization of cloud services, exploring options such as reserved instances and other cost-saving measures. This proactive stance enables us to uphold a lean and efficient infrastructure, passing on the cost benefits to our customers without compromising on performance or reliability.
Our commitment to efficiency and innovation allows us to deliver a cost-effective solution that prioritizes quality and scalability, ensuring our platform remains responsive and resilient even during periods of fluctuating traffic volume.
Conclusion
There is a lot Edgegap does to make sure that whatever traffic volume your game gets, it is able to rapid scale to meet player demand at the highest level.
Still, there is a lot you can do to plan as to prevent errors. We hope this guide was helpful to improve your backend, and ultimately, give you the ultimate reward – the peace of mind that whatever happens, your multiplayer is going to scale without issues and ensure your players have a fun, smooth experience.
Written by
the Edgegap Team