UTF-16 support for ODBC (for MSSQL) #1041

bold84 · 2023-04-04T18:30:42Z

This pull request adds support for UTF-16 encoding in the ODBC module. The module previously only supported UTF-8 encoding, which made it difficult to exchange data with non-UTF-8 MSSQL databases properly.

With this new feature, the ODBC module can now properly exchange data with MSSQL databases that use UTF-16 encoding. This is especially important for users who work with databases that use different encodings or require internationalization support.

The implementation uses std::wstring_convert to convert between UTF-8 and UTF-16 encodings. The toUtf16() and toUtf8() functions have been added to handle the conversion between the two encodings.

There seem to have been issues in the past:

#164
#179
#1111

…rces

vadz

Thanks for working on this, but I don't know how do I feel about introducing a separate build variant for Unicode support. IMO it would be really better to just use UTF-8 in all builds, instead of requiring a special build mode to handle Unicode.

And it would be really nice to have some description of this option in the docs, if only to explain what does enabling it change.

Finally, even if this is relatively trivial, it looks like the use of SQLTCHAR could avoid some preprocessor checks in the code.

tests/odbc/test-odbc-mssql.cpp

tests/common-tests.h

src/backends/odbc/vector-use-type.cpp

Co-authored-by: VZ <vz-github@zeitlins.org>

bold84 · 2024-03-19T18:28:34Z

Thanks for working on this, but I don't know how do I feel about introducing a separate build variant for Unicode support. IMO it would be really better to just use UTF-8 in all builds, instead of requiring a special build mode to handle Unicode.

And it would be really nice to have some description of this option in the docs, if only to explain what does enabling it change.

Finally, even if this is relatively trivial, it looks like the use of SQLTCHAR could avoid some preprocessor checks in the code.

The problem I have is that the data has been written already with another application and I have no influence on that. The database(s) are not UTF-8.

The changes in this PR were the only way I could figure out to prevent reading jibberish:

Perhaps you have a better suggestion on how to solve this problem. I'd be glad to implement a more elegant solution.

…d/soci into odbc_unicode_support # Conflicts: # docs/installation.md

bold84 · 2024-03-19T20:18:40Z

maybe connection_parameters can be abused to decide at runtime what type of char / string is stored in the database.

I think it might also be possible to SQLDescribeCol and determine at runtime then whether to convert back and forth. But the performance penalty is unacceptable.

Or an additional type could be introduced:

ODBC Data Type	SOCI Data Type (`db_type`)	`row::get<T>` specializations
SQL_WCHAR, SQL_WVARCHAR	db_wstring	std::string

The user facing interface would then still only support UTF-8 (note: no std::wstring).

vadz · 2024-03-19T20:42:08Z

The problem I have is that the data has been written already with another application and I have no influence on that. The database(s) are not UTF-8.

I see, thanks.

Perhaps you have a better suggestion on how to solve this problem.

We do need to add SQLWCHAR support to be able to work with the existing database using UTF-16, but I think it should be available in addition to UTF-8 support with plain SQLCHAR, with the decision about which one to use being performed at run-time rather than compile-time.

Ideally, this should be automatic, i.e. when exchanging data with SQLWCHAR columns, UTF-16 should be read from/written to them, while the same code should use UTF-8 when working with SQLCHAR columns, but I don't know if it's possible to implement this easily, so perhaps we need a new db_wstring type instead. If we could make this work automatically, it would be really great, however.

The main point is that I'd really, really love to avoid different incompatible builds. The conditional compilation directives in the tests are a good example of how we do not want the code using SOCI to look like. And there are other problems, e.g. we'd need to add wide char builds to the CI too if we do it like this and I'd rather avoid it.

bold84 · 2024-03-19T20:49:18Z

The problem I have is that the data has been written already with another application and I have no influence on that. The database(s) are not UTF-8.

I see, thanks.

Perhaps you have a better suggestion on how to solve this problem.

We do need to add SQLWCHAR support to be able to work with the existing database using UTF-16, but I think it should be available in addition to UTF-8 support with plain SQLCHAR, with the decision about which one to use being performed at run-time rather than compile-time.

Ideally, this should be automatic, i.e. when exchanging data with SQLWCHAR columns, UTF-16 should be read from/written to them, while the same code should use UTF-8 when working with SQLCHAR columns, but I don't know if it's possible to implement this easily, so perhaps we need a new db_wstring type instead. If we could make this work automatically, it would be really great, however.

The main point is that I'd really, really love to avoid different incompatible builds. The conditional compilation directives in the tests are a good example of how we do not want the code using SOCI to look like. And there are other problems, e.g. we'd need to add wide char builds to the CI too if we do it like this and I'd rather avoid it.

Okay, I think we're on the same page. Shall there be an implicit conversion to std::string or shall std::wstring be supported in case of SQLWCHAR based columns?

Krzmbrzl · 2024-03-19T20:57:28Z

Also note that wstring_convert has been deprecated and scheduled for removal from the standard (without replacement). It's a shame, but either way everyone here should be aware of this 👀

bold84 · 2024-03-19T21:04:50Z

Also note that wstring_convert has been deprecated and scheduled for removal from the standard (without replacement). It's a shame, but either way everyone here should be aware of this 👀

That could temporarily (until there's a replacement) be solved with icu, libiconv or alike. But I wasn't sure if that's acceptable at this point.
On the other hand, it could be optional.

vadz · 2024-03-19T23:50:21Z

Shall there be an implicit conversion to std::string or shall std::wstring be supported in case of SQLWCHAR based columns?

I think it would make sense to support wstring as people using SQLWCHAR in ODBC are most likely to use it too. But ideally I'd like to be able to support string to/from SQLWCHAR mapping too using UTF-8.

That could temporarily (until there's a replacement) be solved with icu, libiconv or alike.

I'd rather not pull in ICU just for this and libiconv is Unix-only. If necessary, I can contribute my own code, written many years ago, converting between UTF-8 and wchar_t (i.e. either UTF-16 or UTF-32). It has some lower level functions as well as

std::string ToUTF8(const std::wstring& wstr);
std::wstring FromUTF8(const std::string& uft8);

bold84 · 2024-03-19T23:54:36Z

That sounds good!

Okay, I'll add std::wstring support in a separate PR first.

Afterwards we can add the conversion to std::string.

bold84 · 2024-03-22T12:49:02Z

We might want to move the discussion to here:

#1133

Added UTF-16 support to session, statement and standard-into-type sou…

832a796

…rces

bold84 force-pushed the odbc_unicode_support branch from e9e296e to 832a796 Compare March 17, 2024 13:16

bold84 added 4 commits March 17, 2024 20:39

Merge remote-tracking branch 'origin/master' into odbc_unicode_support

f52e669

stack-use-after-scope

d0694cc

removed #ifdef SOCI_ODBC_WIDE from test-odbc-mssql.cpp

5c300dd

Fixed unit test

fd58e16

bold84 marked this pull request as ready for review March 17, 2024 18:26

bold84 changed the title ~~[WIP] UTF-16 support for ODBC (for MSSQL)~~ UTF-16 support for ODBC (for MSSQL) Mar 17, 2024

cleanup

f862821

bold84 mentioned this pull request Mar 19, 2024

SOCI doesn't compile when using the UNICODE flag on Windows #1111

Open

vadz reviewed Mar 19, 2024

View reviewed changes

Update src/backends/odbc/vector-use-type.cpp

5d69724

Co-authored-by: VZ <vz-github@zeitlins.org>

bold84 added 6 commits March 20, 2024 02:13

implemented requested changes and changed CMAKE option name

d281bb8

Added documentation

0dd1066

Added documentation

29f5928

Merge branch 'odbc_unicode_support' of https://github.com/ORDIS-Co-Lt…

f3f1c0a

…d/soci into odbc_unicode_support # Conflicts: # docs/installation.md

Merge branch 'master' into odbc_unicode_support

c922daf

removed u8

a0d0686

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UTF-16 support for ODBC (for MSSQL) #1041

UTF-16 support for ODBC (for MSSQL) #1041

bold84 commented Apr 4, 2023 •

edited

vadz left a comment

bold84 commented Mar 19, 2024

bold84 commented Mar 19, 2024 •

edited

vadz commented Mar 19, 2024

bold84 commented Mar 19, 2024 •

edited

Krzmbrzl commented Mar 19, 2024

bold84 commented Mar 19, 2024

vadz commented Mar 19, 2024

bold84 commented Mar 19, 2024

bold84 commented Mar 22, 2024

UTF-16 support for ODBC (for MSSQL) #1041

Are you sure you want to change the base?

UTF-16 support for ODBC (for MSSQL) #1041

Conversation

bold84 commented Apr 4, 2023 • edited

vadz left a comment

Choose a reason for hiding this comment

bold84 commented Mar 19, 2024

bold84 commented Mar 19, 2024 • edited

vadz commented Mar 19, 2024

bold84 commented Mar 19, 2024 • edited

Krzmbrzl commented Mar 19, 2024

bold84 commented Mar 19, 2024

vadz commented Mar 19, 2024

bold84 commented Mar 19, 2024

bold84 commented Mar 22, 2024

bold84 commented Apr 4, 2023 •

edited

bold84 commented Mar 19, 2024 •

edited

bold84 commented Mar 19, 2024 •

edited