Haya,
I’m trying to use StringToBytes to convert a string into a byte array so I can encrypt it, but it seems like some unicode characters cause overflows that give a different string when we call BytesToString on the output byte array. It looks like StringToBytes assumes that you’re string is UTF-8, but FString can support UTF-16, so any UTF-16 characters seem to break it down.
The character I tested with is ™.
This line seems to be the problem:
OutBytes[ NumBytes ] = (int8)(*CharPos - 1);
Since the function is shoving wchar_t’s into an int8, which isn’t big enough to store all the values.
Thank you for the sample code, I was able to reproduce the issue and have logged a report for it here (Unreal Engine Issues and Bug Tracker (UE-33889)) . You can track the report’s status as the issue is reviewed by our development staff.
I’ve done some looking into this issue, and here are a few notes that I’ve come up with.
The sample code here is not allocating enough space to hold the ™. inString.Len() only returns the number of characters in the string. If you were to store the string with two bytes per character, your array would need to be inString.Len() * sizeof(TCHAR).
StringToBytes casts the TCHAR characters to an int8 which is why it is losing data when converting ™.
On the BytesToString side of things, only a single byte at a time is used to set the characters of the FString.
This leaves you with the question of, “How should this be fixed?” From reading the documentation, it looks like Unread Strings are stored as USC-2 internall which I believe means 2-bytes per character. I’m far from an expert on unicode but I believe the difference form UTF-16 here is that you FString does not support multiple code points combining into a single character. That means that StringToBytes could be updated to simply store two bytes per character. That would also mean that BytesToString would need to be updated to handle this. Herein lies the problem of backwards compatibility because anyone who had used StringToBytes prior to this change and saved it to disk, would find that their can no longer use BytesToString.
One potential solution to this problem would be to do something similar to what the FString docs mention about serialization. It states that if the TCHAR < 0xff, then it stores a single byte, otherwise, it stores 2-bytes. Updating the two functions in question to work this way might resolve everyone’s problem since you would still be able to use BytesToString on existing saved bytes due to the fact it can handle the bytes as 1-byte or 2-bytes per character. This approach would need approval from the Epic devs though.
StringToBytes and BytesToString do some interesting things as part of their conversion. For instance, when converting to bytes, each character has 1 subtracted from it. On the other end, 1 is added back to each character. That alone makes it seem like you probably don’t want to use those functions if you’re looking for an exact representation of the string in bytes.
Another concern is that these methods don’t define an encoding for the characters in bytes. This is concerning if you wanted to pass the bytes to some other library that accepts an array of bytes that are to have a specific encoding.
You might find help looking into StringConv.h. In there, you will see two helper classes called FTCHARToUTF8 and FUTF8ToTCHAR. I’m not sure how well these are supported, but they appear to work in the following sample code.
The adding and subtracting 1 is to handle null terminators. It’s normal.
I know what the issue is, and I’ve already fixed it locally. I was just reporting the bug so Epic could look into it on their side and figure out what they wanted to do because there isn’t a great solution that has the functions working the way you expect them too and remains backwards compatible.