That’s the key point.
With a rokoko suit, assuming you also have the cube thing, you achieve a totally no better than mediocre end result.
A result which you still need to manually clean up, which ends up requiring a more than significant time investment (which is what one would normally assume the suit money is spent to prevent).
The coding can come in to help lessen said cleanup. Instead of having your fingers 10cm away from the can like Rokoko would pass into the engine, the engine itself can be made to intuitively adjust itself so as to require Less work than you’d normally need.
However - none of that is actually done with Phat at all.
The mocap suit reads the bone values that are sent (usually via livelink plugin).
The values of it are defined elsewhere.
The “physical” object in real life doesn’t necessarily match or have anything to do with the object in game.
Your example of a can, can probably be done just the same by picking up a vive tracker directly - since placing the tracker in a can becomes problematic to say the least.
Also, don’t go thinking that the finger tracking of the gloves is ANY accurate (for rokoko as for any other similar products that don’t directly use markers and 3d cameras).
Even with the “magic” ef emitting box rokoko offers you get very sub-par results (likely not even worth the expense).
In other words, you get close. Closer than you would with having to do it all manually - but you still have to do Most of it manually.
So, cleaning up the animations by coding custom stuff in engine that you can leverage to produce a final “take” is really worth the time.
But again, it’s not done via Physics, 99% of the time it’s just done with straight up math.
Or data transfer if you will.
The video uses cameras and trackers - which is how it manages to be more accurate than the rokoko/sensor based BS suits (worse expense I ever made hands down - I do wonder if Vicon is around 4K all in or less). Generally speaking - you are better off paying a mocap studio to record what you need since they give you the end product complete of all the adjustments for a fraction of what the system to do it yourself costs.
You can clearly see bits of the video where the gun grips are way off - that’s the nature of passing data without any sort of post-process adjustment.
He could improve on the grip system by just adding a bit of code to do the proper post-processing for him…
Your physics would only come into play whenever you “throw” an object and you want the engine to be in charge of the end result/simulation.
Otherwise, everything/anything is just linked to the data that you import from the mocap system (whatever it may be).