If you want to avoid hand animation and lean on lipsync tooling, you will likely need some way to interpret the data those tools provide. It’s been a good while since I’ve last looked at lipsync, but the way I’ve seen it done in a few places was to extract visemes and drive blends based on those.
I can’t speak to the audio to facial animation system, but Microsoft’s system appears to function this way.
Either way, you likely need blendshapes or joint poses that express visemes for the mesh in some form to make use of automated lipsync.