Compact 2013 Ebook

13.11 Unicode Strings
Created by djones on 7/20/2013 12:58:03 PM

Unicode Strings

Coauthored with Thierry Joubert

In the code examples in the following chapters in this section of the book, the types and functions for strings manipulations included WCHAR, TCHAR, _T(...), _tprintf(...) etc. These are different from the usual C style, char and printf(...). The reason for this difference comes from character encoding; to allow the use of international alphabets all Windows operating systems since NT-3.1 in 1993 – including Windows CE and Windows Embedded Compact – internally use Unicode character representation as opposed to the old ASCII representation. As a consequence, in the native C WIN32 API all strings are defined as 16-bit arrays and they must contain the value 0x0000 as a termination marker (just as C ASCII strings must contain a final 0x00). It is the responsibility of the developer to provide a correct memory mapping for these Unicode strings, and no automatic or easy translation is available as we use C style arrays.

Warning: Using casts like in (short*)my_char_string or (char*)my_short_string will let you compile and link but it does no translation and the receiver will interpret another string than the one you encoded.

The Unicode character representation wouldn’t be a problem if C developers didn’t spent more than 20 years with ASCII strings deeply embedded in their language. As a consequence a large quantity of legacy code uses ASCII strings, and when the WIN32 API was introduced Microsoft took care of providing both a Unicode and an ASCII version for each function having at least one string argument. Most desktop developers ignore this as the choice is based on a wizard-defined pre-processor macro named _UNICODE, and they provide ASCII strings to Windows in total impunity. There is still a price to pay because the ASCII version of the WIN32 functions allocate Unicode strings on the heap to do the translation before the call, and do a clean up on return.

Table 1.3: Examples of WIN32 API functions

Generic name

ASCII

#undef _UNICODE

Windows desktop ONLY!!

Unicode

#define _UNICODE

CreateFile

CreateFileA

CreateFileW

CreateThread

CreateThread

CreateThread

CreateMutex

CreateMutexA

CreateMutexW

GetConsoleMode

GetConsoleMode

GetConsoleMode

GetConsoleTitle

GetConsoleTitleA

GetConsoleTitleW

Warning: The WIN32 API for Windows Embedded Compact is strictly Unicode, it does not contain the XxxxA functions. Your native code may not migrate instantly from a Windows desktop project to a Windows Compact project.

In order to incorporate Unicode character representation in the C syntax and runtime, both Microsoft and the ANSI Institute implemented extended types and libraries. Everybody agreed on the term ”wide” to qualify the newcomers, therefore all Unicode types and functions will have a ’w’ prefix like in ANSI wchar_t or wprintf, or a ’W’ prefix like in Windows WCHAR or WSTR.

Note: One exception to the ’w’ is the compiler rule to create a constant Unicode string where an ’L’ prefix is required before the string expression ( L”Hello World” ).

Another initiative was taken by ANSI to abstract the Unicode vs ASCII differences in order to allow for source code portability. This abstraction uses ”generic types” based on the UNICODE pre-processor macro. The generic types and functions have a ’_t’ or ’T’ prefix as in _tprintf or TCHAR. One result of these combined efforts is some redundancy at a syntactic level.

Note: In Windows wchar_t and WCHAR are equivalent to the same multibyte character data type. The generic type TCHAR will also be a multibyte character if UNICODE is defined.

The next table shows some corresponding generic, ASCII and UNICODE types and functions

Table 1.4: Generic types equivalence

Generic type, function or macro name

ASCII

#undef UNICODE

Unicode

#define UNICODE

TCHAR

char

wchar_t

_T( ) or TEXT( )

Does nothing

Add ’L’ prefix

_tmain( )

main( )

wmain( )

_tprintf( )

printf( )

wprintf( )

_tcslen( )

strlen( )

wcslen( )

_tcscat( )

strcat( )

wcscat( )

_ttoi

atoi( )

_wtoi( )

_getts( )

gets( )

_getws( )

_putts( )

puts( )

_putsws( )

(string) TCHAR* or TCHAR[ ]

char* or char[ ]

wchar_t* or wchar_t[ ]

Note1: Windows Embedded Compact projects always default to the definition of _UNICODE and UNICODE.

Note2: A catalog component provides ”String Safe Utility Functions”, these functions ending with _s should be preferred as they guard against buffer overrun.

The ASCII (or Cstring) ”Hello World!” has its Unicode equivalent L”Hello Word!”. The _T or TEXT macro resolves this by delivering an ASCII string literal or Unicode string literal depending upon the definition of the UNICODE macro.

Last but not least, in an effort towards syntax abstraction the Microsoft WIN32 API redefines all C base types and provides its own types in windows.h. These ”Windows” types are uppercase and they use a field composition technique like in LPVOID where you should read Long Pointer[1] to void (which ends-up into void*).

This syntax translation combined with ASCII vs. Unicode option created a lot of confusion since the early days of the WIN32 API. If you apply the field composition technique of Windows types and everything you just learned about Unicode types, plus the fact that STR means what is usually called sz for ”Zero (terminated) String”, decryption becomes fairly straightforward.

Table 1.5: Windows string types

Type

Description

String type

Base type

LPSTR

Long pointer to zero string

ASCII

char *

LPWSTR

Long pointer to wide zero string

Unicode

WCHAR *

LPCSTR

Long pointer to constant zero string

ASCII

char *

LPCWSTR

Long pointer to constant wide zero string

Unicode

WCHAR *

LPTSTR

Long pointer to generic zero string

Either

TCHAR *

LPCTSTR

Long pointer to constant generic zero string

Either

TCHAR *

Note: zero strings have implicit length, the WIN32 API offers other types of strings like the BSTR (also known as OLECHAR) which has an explicit length. Never copy a BSTR to an STR, you must do a translation.

Table 1.6: Windows string code samples

Type

Type Definition

Example

LPSTR

typedef char* LPSTR;

char name[ ] = ”Compact2013”;

LPWSTR

typedef WCHAR* LPWSTR;

wchar_t name[ ] = L” Compact2013”;

LPCSTR

typedef const char* LPCSTR;

const char name[ ] = ” Compact2013”;

LPCWSTR

typedef const WCHAR* LPCWSTR;

const wchar_t name[ ] = L” Compact2013”;

LPTSTR

typedef TCHAR* LPTSTR;

TCHAR name[ ] = _T("Compact2013");

LPCTSTR

typedef const TCHAR* LPCTSTR;

Const TCHAR name = _T(”Compact2013”);

Time for some assessment.

Question: I’m working on a native Compact 2013 console project and I need a string input from the user, how should I declare my string ?

Answer: You may declare your string as ASCII, Unicode or Generic, it only depends on how you will use it from inside your program.

// ASCII version
char   strA[MAXSTR];
scanf("%s", strA);
printf("you typed: %s", strA);
// UNICODE version 
wchar_t   strW[MAXSTR];
wscanf(L"%s", strW);
wprintf(L"you typed: %s", strW)
// generic version
TCHAR  strG[MAXSTR];
_tscanf(_T("%s"), strG);
_tprintf(_T("%s"), strG);

Listing 1.3: Various versions of string code usage. ASCII, Unicode and Generic

All these implementations are valid in a native Compact 2013 project, remember that Unicode strings are only required at the WIN32 API level, as long as your strings are never used as parameters of a WIN32 function there is no reason to force Unicode or to use the Generic types. The choice has an impact on memory consumption because wide strings use twice the memory of an ASCII string, this may be a concern in an embedded application. The choice may also depend on text serialization/deserialization when the format is imposed by specifications.

Note: In the three samples provided the macros, variable types and functions types are always consistent. You must avoid cross-typing in your code as it may only compile or work in some restricted cases.


[1] Long stands here for bigger than 16-bit, remember there was (primitive) life before the WIN32 API


NEXT:  A Generic Compact 2013 Operating System for Application Development – Outline

print

Click here to provide feedback and input

  Comments


Turkish porno izle video site in rokettubeporno izle