Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UTF-16 support for ODBC (for MSSQL) #1041

Open
wants to merge 13 commits into
base: master
Choose a base branch
from

Conversation

bold84
Copy link

@bold84 bold84 commented Apr 4, 2023

This pull request adds support for UTF-16 encoding in the ODBC module. The module previously only supported UTF-8 encoding, which made it difficult to exchange data with non-UTF-8 MSSQL databases properly.

With this new feature, the ODBC module can now properly exchange data with MSSQL databases that use UTF-16 encoding. This is especially important for users who work with databases that use different encodings or require internationalization support.

The implementation uses std::wstring_convert to convert between UTF-8 and UTF-16 encodings. The toUtf16() and toUtf8() functions have been added to handle the conversion between the two encodings.

There seem to have been issues in the past:

#164
#179
#1111

@bold84 bold84 marked this pull request as ready for review March 17, 2024 18:26
@bold84 bold84 changed the title [WIP] UTF-16 support for ODBC (for MSSQL) UTF-16 support for ODBC (for MSSQL) Mar 17, 2024
Copy link
Member

@vadz vadz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this, but I don't know how do I feel about introducing a separate build variant for Unicode support. IMO it would be really better to just use UTF-8 in all builds, instead of requiring a special build mode to handle Unicode.

And it would be really nice to have some description of this option in the docs, if only to explain what does enabling it change.

Finally, even if this is relatively trivial, it looks like the use of SQLTCHAR could avoid some preprocessor checks in the code.

tests/odbc/test-odbc-mssql.cpp Outdated Show resolved Hide resolved
tests/common-tests.h Outdated Show resolved Hide resolved
src/backends/odbc/vector-use-type.cpp Outdated Show resolved Hide resolved
src/backends/odbc/vector-use-type.cpp Outdated Show resolved Hide resolved
src/backends/odbc/vector-use-type.cpp Outdated Show resolved Hide resolved
Co-authored-by: VZ <vz-github@zeitlins.org>
@bold84
Copy link
Author

bold84 commented Mar 19, 2024

Thanks for working on this, but I don't know how do I feel about introducing a separate build variant for Unicode support. IMO it would be really better to just use UTF-8 in all builds, instead of requiring a special build mode to handle Unicode.

And it would be really nice to have some description of this option in the docs, if only to explain what does enabling it change.

Finally, even if this is relatively trivial, it looks like the use of SQLTCHAR could avoid some preprocessor checks in the code.

The problem I have is that the data has been written already with another application and I have no influence on that. The database(s) are not UTF-8.

The changes in this PR were the only way I could figure out to prevent reading jibberish:

Screenshot 2024-03-20 at 01 25 50

Perhaps you have a better suggestion on how to solve this problem. I'd be glad to implement a more elegant solution.

@bold84
Copy link
Author

bold84 commented Mar 19, 2024

maybe connection_parameters can be abused to decide at runtime what type of char / string is stored in the database.

I think it might also be possible to SQLDescribeCol and determine at runtime then whether to convert back and forth. But the performance penalty is unacceptable.

Or an additional type could be introduced:

ODBC Data Type SOCI Data Type (db_type) row::get<T> specializations
SQL_WCHAR, SQL_WVARCHAR db_wstring std::string

The user facing interface would then still only support UTF-8 (note: no std::wstring).

@vadz
Copy link
Member

vadz commented Mar 19, 2024

The problem I have is that the data has been written already with another application and I have no influence on that. The database(s) are not UTF-8.

I see, thanks.

Perhaps you have a better suggestion on how to solve this problem.

We do need to add SQLWCHAR support to be able to work with the existing database using UTF-16, but I think it should be available in addition to UTF-8 support with plain SQLCHAR, with the decision about which one to use being performed at run-time rather than compile-time.

Ideally, this should be automatic, i.e. when exchanging data with SQLWCHAR columns, UTF-16 should be read from/written to them, while the same code should use UTF-8 when working with SQLCHAR columns, but I don't know if it's possible to implement this easily, so perhaps we need a new db_wstring type instead. If we could make this work automatically, it would be really great, however.

The main point is that I'd really, really love to avoid different incompatible builds. The conditional compilation directives in the tests are a good example of how we do not want the code using SOCI to look like. And there are other problems, e.g. we'd need to add wide char builds to the CI too if we do it like this and I'd rather avoid it.

@bold84
Copy link
Author

bold84 commented Mar 19, 2024

The problem I have is that the data has been written already with another application and I have no influence on that. The database(s) are not UTF-8.

I see, thanks.

Perhaps you have a better suggestion on how to solve this problem.

We do need to add SQLWCHAR support to be able to work with the existing database using UTF-16, but I think it should be available in addition to UTF-8 support with plain SQLCHAR, with the decision about which one to use being performed at run-time rather than compile-time.

Ideally, this should be automatic, i.e. when exchanging data with SQLWCHAR columns, UTF-16 should be read from/written to them, while the same code should use UTF-8 when working with SQLCHAR columns, but I don't know if it's possible to implement this easily, so perhaps we need a new db_wstring type instead. If we could make this work automatically, it would be really great, however.

The main point is that I'd really, really love to avoid different incompatible builds. The conditional compilation directives in the tests are a good example of how we do not want the code using SOCI to look like. And there are other problems, e.g. we'd need to add wide char builds to the CI too if we do it like this and I'd rather avoid it.

Okay, I think we're on the same page. Shall there be an implicit conversion to std::string or shall std::wstring be supported in case of SQLWCHAR based columns?

@Krzmbrzl
Copy link
Contributor

Also note that wstring_convert has been deprecated and scheduled for removal from the standard (without replacement). It's a shame, but either way everyone here should be aware of this 👀

@bold84
Copy link
Author

bold84 commented Mar 19, 2024

Also note that wstring_convert has been deprecated and scheduled for removal from the standard (without replacement). It's a shame, but either way everyone here should be aware of this 👀

That could temporarily (until there's a replacement) be solved with icu, libiconv or alike. But I wasn't sure if that's acceptable at this point.
On the other hand, it could be optional.

@vadz
Copy link
Member

vadz commented Mar 19, 2024

Shall there be an implicit conversion to std::string or shall std::wstring be supported in case of SQLWCHAR based columns?

I think it would make sense to support wstring as people using SQLWCHAR in ODBC are most likely to use it too. But ideally I'd like to be able to support string to/from SQLWCHAR mapping too using UTF-8.

That could temporarily (until there's a replacement) be solved with icu, libiconv or alike.

I'd rather not pull in ICU just for this and libiconv is Unix-only. If necessary, I can contribute my own code, written many years ago, converting between UTF-8 and wchar_t (i.e. either UTF-16 or UTF-32). It has some lower level functions as well as

std::string ToUTF8(const std::wstring& wstr);
std::wstring FromUTF8(const std::string& uft8);

@bold84
Copy link
Author

bold84 commented Mar 19, 2024

That sounds good!

Okay, I'll add std::wstring support in a separate PR first.

Afterwards we can add the conversion to std::string.

@bold84
Copy link
Author

bold84 commented Mar 22, 2024

We might want to move the discussion to here:

#1133

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants