Split text into words

alexpanter · October 30, 2023, 11:42am

Difficult to imagine nobody trying to help for 7 years xD

Anyways, with my still limited experience with C++ in Unreal, I would do as I normally would with C++:

Think of the most possibly efficient solution

So naturally, if we can avoid creating new strings for each word (ie. limit the number of dynamic memory allocations), we can speed it up quite a bit. Unreal gives us an equivalent of std::string_view, which we can use to point to individual words inside the string without creating new strings and copying all the characters over. The longer the input string, the larger the performance gain. The algorithm is a bit like this:

Ensure we have a local copy of the input string so it doesn’t get garbage collected while we work on it!
Create an array to hold the string views: TArray<TStringView<TCHAR>> words;
Iterate through the string and create a new string view each time the delimiting character (e.g. space) is encountered.

Regarding 3., we can create a string view from a string with a start position and a length:
TStringView<TCHAR> myView(const wchar_t* data, int32 size). Instead of data, we can create an iterator from the string and use it to loop through the characters:

FString input;
for (auto it = input.begin(); it != input.end(); it++)
{
    if (*it == ' ') {
        const wchar_t* ptr = &it.operator*();
        TStringView<TCHAR> v(ptr, 5); // arbitrary word length of 5
    }
}

There are (quite) a few gotchas that we should keep in mind:

we cannot use a string iterator in a range-based for loop
a string view is by definition not null-terminated
we use wchar_t which might be UTF-16 or UTF-8 depending on platform, so we should probably read up on the differences and check that we do stuff in a portable manner. (super annoying topic, I blame Microsoft )
We should keep in mind that there may be an arbitrary amount of whitespace, at both ends of the input string and in between words.
We should probably read the C++ spec on string views so we can use them optimally

Hopefully this will help someone else looking for a solution to this problem.
I might update / extend this post when I have created and tested a stable solution.

Side note: O(n^2) sounds really scary! But for strings shorter than 200 characters (or thereabouts), it isn’t that big of a deal. But it we do this many times per second, we might start to notice .