Introducing NeuroSync: Real-Time 60fps Audio-to-Face Animation with Transformer-Based Seq2Seq Neural Networks

Yes, you can do that before its sent to the api or add a converter into the api if you know the sample rate is going to be the same each time.

Currently, it requires wav because it upsamples to 88200 regardless for the feature extraction - this is to allow for variance in the sample rate to be more easily dealt with (as you can tts in or mic in, the higher the sample rate the better the face shapes from the audio).

The python really is a guide on how you might use it, not production code for use in a project as is - an application is under development for those not confident with implementing thier own solution.