I just want to understand so can give a correct answer since contexual interactions are not super easy, I generally tend to do this procedurally as possible.
You want First Person to Third Person camera blend during an interaction where.
1-Camera moves slightly outside, you hide 1st person objects and show third person things. or whatever depending on viewmodels.
2-Prepeare transition like cut controls, handle collisions on objects or other gameplay things.
2-Animation happens on character, during that time we move capsule to its designated position or root motion.
3-Transition back and we hide thirdperson and show first person. Give controls back.
Not mentioning object bounds, sizing, direction changes etc. is this what you are trying to do?
Cause cannot understand why there is a cutscene? Cutscene is something different.