Devlog 4 - Music/Game synchronization

Published: June 11th, 2021

Today I'm going to be taking an initial look at Rhythm Runner's music synchronization code, both as an informative technical log and also as a way to refresh my memory of the system that I've ended up building. There's a LOT of under-the-hood logic that goes on in trying to keep everything working properly, so there will probably be some more followup posts about music sync. For now, let's try to cover the basics (synchronizing gameplay to a backing music track).

Background

The art of synchronizing music to gameplay is a bit of an arcane one, with numerous compounding factors and difficulties interfering with what seems like a simple concept. As much as you'd like to just "play the music and start the game at the same time", audio buffers, DSP timing, input latency, display lag, timing drift, and other things need to be considered in order to have a robust approach. The exact nature of these problems and their corresponding solutions are also engine-specific most of the time.

Regardless, here is a brief but nonextensive list of the things that can go wrong when using a naive latency-unaware approach to music/gameplay synchronization:

Audio scheduling - Telling an audio clip to play doesn't send audio data to the hardware instantaneously. The audio needs to be loaded, scheduled, mixed, etc.
Audio buffering - Audio is mixed using circular buffers and the exact size of these buffers adds some amount of latency to audio output.
Frame-based timing - Input/audio/timing is often handled on a per-frame basis, even if the actual events happened between frames.
Input latency - There may be some amount of input latency, either hardware or (more likely) from the game engine, particularly for touch-based inputs.
Visual/display latency - There's some amount of delay before a visual update can actually be seen on the screen (vsync, monitor lag, etc.)
Player - Players can "expect" certain amounts of latencies due to previous experience with other games
Etc...

I'd highly recommend studying the notes of Exceed7 Experiments (https://exceed7.com/native-audio/rhythm-game-crash-course/index.html) on this subject if you're unfamiliar with thinking about these sorts of concepts.

A Naive Approach

An initial stab at creating music/gameplay synchronization might be to do something like this:

void Update() {
    player.transform.position.x = audioSource.time * SOME_SPEED_CONSTANT;
}

This works great for a quick-and-dirty prototype (it's essentially what I did for the initial version of Ripple Runner way back in the day...), but unfortunately, there are a couple of problems here:

First, this isn't latency aware, so all of the problems that are listed above apply. audioSource.time tells you "the playback position of the clip from the point of view of the engine", but this doesn't take into account all of the additional processing steps that happen later down the audio processing chain.

Second, audioSource.time doesn't actually update smoothly between frames. So you may get odd results like audioSource.time reading the same value in two consecutive frames, or changing by twice as much during one frame as the next, which results in stuttery movement. Fundamentally this is due to the audio system running on an entirely different timeline than normal "game time", but also because this value is based on the start of the current audio buffer, not the current playback position.

Using PlayScheduled

Unity exposes an AudioSettings.dspTime value which will return the current time from the audio system's point of view. From here on out I'll refer to this as "DSP Time", as opposed to "Game Time" which is the frame-based timing that you'd get from Unity's Time.unscaledTime value. If you ever get confused, remember that for the most part, doubles are DSP time and floats are game time.

We can use Unity's AudioSource.PlayScheduled function in order to schedule our backing track in advance at a particular Audio DSP time. Given enough of a scheduling window (I use a full second, which should be plenty), this guarantees that the audio track will consistently start playing exactly at that Audio DSP time (though of course, there will still be latency...).

We call this schedule point _audioDspStartTime. This represents the DSP Time at which the music track first starts playing.

void Start() {
    _audioDspStartTime = AudioSettings.dspTime + kStartScheduleBuffer;
    _musicSource.PlayScheduled(_audioDspStartTime);
}

Mapping from DSP Time to Game Time

Unfortunately, there's no "exact" way to map from a DSP timestamp to a game time value, since the two systems update at different intervals. However, we can get pretty close by using a linear regression. The Twitter thread at https://twitter.com/FreyaHolmer/status/845954862403780609 contains a graph illustration of what this looks like, if that helps.

I have an AudioDspTimeKeeper script that is responsible for managing this linear regression mapping throughout the application's lifetime:

void Update() {
    float currentGameTime = Time.realtimeSinceStartup;
    double currentDspTime = AudioSettings.dspTime;

    // Update our linear regression model by adding another data point.
    UpdateLinearRegression(currentGameTime, currentDspTime);
}

Here, UpdateLinearRegression() is just a bunch of statistics math that uses the average, variance, and covariance to establish a linear regression model. I'm sure you can find an implementation in your favorite programming language if you search for it. Currently I keep a rolling average of 15 data points for this regression.

The output of the linear regression model is a set of two coefficients which determines a line mapping, so we can then expose the following function:

public double SmoothedDSPTime() {
    return Time.unscaledTimeAsDouble * _coeff1 + _coeff2;
}

There's one more detail that needs to be addressed, which is that since our linear regression model is constantly being updated, we might get a little bit of jitter. That's fine, but we should make sure that our SmoothedDSPTime() function is monotonically increasing, otherwise there's a chance that the player might move backwards for a frame, which would probably break a lot of things:

public double SmoothedDSPTime() {
    double result = Time.unscaledTimeAsDouble * _coeff1 + _coeff2;
    if (result > _currentSmoothedDSPTime) {
        _currentSmoothedDSPTime = result;
    }
    return _currentSmoothedDSPTime;
}

Putting it together

We now have an AudioDspTimeKeeper.SmoothedDSPTime() function that will give us the (smoothed) current audio DSP time for the current frame. We can now use this as our timekeeping function, in conjunction with our _audioDspStartTime that we set when we first scheduled the backing music track:

double GetCurrentTimeInSong() {
    return AudioDspTimeKeeper.SmoothedDSPTime() - _audioDspStartTime;
}

And we can simply swap this into our naive approach:

void Update() {
    player.transform.position.x = GetCurrentTimeInSong() * SOME_SPEED_CONSTANT;
}

Latency compensation

Adding latency compensation into the mix is actually really easy! We can add it here:

double GetCurrentTimeInSong() {
    return AudioDspTimeKeeper.SmoothedDSPTime() - _audioDspStartTime - _latencyAdjustment;
}

So for example, with a latency adjustment of 0.05 (50 milliseconds), GetCurrentTimeInSong() will return a value 50 milliseconds lower than it would normally, which means that the player's position will be slightly to the left of where it otherwise would be.

Of course, the hard part is determining what _latencyAdjusment should be, as this is extremely device-specific and will need to be determined via player calibration. But that's a subject for another time...

Resynchronization Fallback

In theory and in practice, everything above works just great. ...as long as nothing goes wrong.

However, the system is a little brittle, as it depends on a single reference point for song progress (this is our _audioDspStartTime value). Usually this is fine, but there are a number of things that could cause the audio playback to stutter and become misaligned with what should actually be playing:

As a sanity check, I check our smoothed DSP time against the value from AudioSource.time. As mentioned earlier, we should never use this value directly in our game calculations due to jitter, but it still provides a nice sanity check in case something went wrong.

void CheckForDrift() {
    double timeFromDSP = AudioDspTimeKeeper.SmoothedDSPTime() - _audioDspStartTime;
    double timeFromAudioSource = _musicSource.time;

    double drift = timeFromDSP - timeFromAudioSource;

    if (Mathf.Abs(drift) > kTimingDriftMargin) {
        Debug.LogWarningFormat("Music drift of {0} detected, resyncing!", musicDrift);
        _audioDspStartTime += musicDrift;
    }
}

Currently I have kTimingDriftMargin at 50 milliseconds, which doesn't seem to trigger unless something actually did go wrong. Unfortunately, this "resync correction" won't be as good or consistent as the original synchronization, but hopefully this won't be needed very often and is just a failsafe.

In the future, I'll probably need to add additional fallbacks here, in case for example the audioSource stops playing entirely for some reason, or never starts even though PlayScheduled was called. This is currently a TODO item for me!

Phew! Apologies for the lack of images in this post, but unfortunately animated gifs don't really help when trying to show off audio synchronization and I don't happen to have any fancy diagrams on hand to help explain this stuff. :x Hopefully this all made some amount of sense regardless! If not, again I would highly recommend reading Exceed7's work on this (see https://exceed7.com/native-audio/rhythm-game-crash-course/backing-track.html for the post on backing track synchronization) for a more detailed explanation.

<< Back: Devlog 3 - Flying Mechanic and Level Generation
>> Next: Devlog 5 - Water/Air Jump Prototyping