DataValidationCommandlet loading strategy is slow at scale

Hello,

We use the DataValidationCommandlet for pre-submit testing. This commandlet has historically taken a long time to execute but its performance has worsened over time. Upon profiling the execution of the commandlet in a “hot” environment, meaning that all asset data is cached and ready, it appears that assets are being loaded unconditionally meaning that a huge amount of time is wasted loading assets that don’t have any matching validators that need to process them.

Total Number Of Assets: ~300,000

Total Number Of Validators: ~12

Total Time Spent in DataValidationCommandlet: ~10 minutes

Total Time Spent in ValidateAsset: 8.2 seconds

Total Time Spent in FlushAsyncLoading: 8.6 minutes

Hardware: 9950X, 128 GB DDR5, Fast NVMe drives.

After looking at the code in UEditorValidatorSubsystem::ValidateAssetsInternal, it appears that it unconditionally loads all assets that it finds before it finds out if any validators actually need this asset to be loaded. While this is not a problem for a small number of assets, as you start to scale up and as the assets get more complex then this loading strategy really starts to cause problems.

It seems like to me that a potential more efficient option would be to first use the minimal asset data that is available in FAssetData and use that to determine if any validator needs this asset to be fully loaded. Obviously you have problems with validators that return true for all assets but those validators can simply be moved off outside of pre-submit tests and into a nightly validator that isn’t as time critical.

Can you please advise on the following.

  1. Am I correct about the unconditional loading of assets here or do we have something misconfigured to cause this and if so, why is it this way?
  2. What approach do you recommend to reduce the time spent executing the data validation commandlet but still maintain full coverage over all assets we need to validate?
  3. What do you think of my approach to first use the minimal FAssetData to determine if any validators actually need this asset to be loaded before loading the asset?
  4. I know there are options to not validate unload assets but then that means assets don’t get validated when we need them to be.

I have attached my UTRACE for reference but it’s probably not that useful.

Looking forward to your reply.

Cheers!

[Attachment Removed]

Hi,

Q: Am I correct about the unconditional loading of assets here or do we have something misconfigured to cause this and if so, why is it this way?

  • Yes, you are correct. It’s probably like that because some validators will run on all assets and validation happens on loaded assets.

Q: What approach do you recommend to reduce the time spent executing the data validation commandlet but still maintain full coverage over all assets we need to validate?

  • In CL 52278046, the load time was ‘optimized’ you can probably cherry pick the changes and try it.

Q: What do you think of my approach to first use the minimal FAssetData to determine if any validators actually need this asset to be loaded before loading the asset?

  • It has potential. You could add a new API to the validator to validate with the FAssetData first, it could return a status, something like “Invalid, Valid, UnknownNeedToLoad”, then you can go through all the validators and if one validator reports ‘NeedToLoad’ you fallback to the usual workflow. So you could do that, but as soon as one validator in the list needs a loaded asset, it breaks. So, like you said, you need to move those validators into another job. I suspect the extra management is the reason why it’s not implemented.

Regards,

Patrick.

[Attachment Removed]

Thanks for the reply. I checked out that CL visually, it might help a bit but in the CL description it mentions testing with 23,000 assets so I think that our 300,000 assets are still likely to just take a very long time, I’ll give it a go though.

Agreed that the benefit of what I am proposing hinges entirely on the composition of validators in use when running the DataValidation commandlet but I think the DataValidation framework would benefit from giving the opportunity to end users to easily compose validation sets optimised to their use case e.g. quick pre-submit check or full nightly check, as of right now it does not offer that configurability and forces everything to be loaded.

The API that you propose is essentially what I was thinking as well, it fits in pretty well with the existing data validation API for CanValidateAsset.

[Attachment Removed]