Can't (correctly) read UTF-8 file in Visual Studio C++

  • Thread starter Thread starter rjmarshall17
  • Start date Start date
R

rjmarshall17

Guest
Hi,

I'll start off with the standard disclaimer of: I'm not a C++ programmer. And I've been trying to read files on Windows that were written on Linux. The files have filenames in them that are made up of random unicode characters (I'm doing a test) that I created from Windows (I'm working with a Samba share). I've been using a short test program because I have to modify a larger program that uses a Windows DLL to collect information on the files and put it into a specific format. I thought that isolating the opening and reading of the file (which will actually be a comma separated file with additional information besides the filename) until I figure out how to do it correctly (the current version of the larger program doesn't handle unicode filenames) was a good approach. I won't say how long I've been battling this...because it's embarrassing.

Eventually, once I know I can read the UTF-8 files correctly, I need to convert the UTF-8 filenames into UTF-16LE for the DLL to consume, but for now I can't even get this work.

The file I'm trying to read as a test is:

PS Z:\> Get-Content -Path Z:\u8_test03 -Encoding UTF8
Z:unicode_test01\level3\êFbén6m0SW9ewGìDÆvi5sê

Which, as you can see, is correctly encoded in UTF8 and Get-Content displays it just fine. However...

PS Z:\> C:\Users\Administrator\RJMTestApp01.exe --readfile Z:\u8_test03
The read file name is: Z:\u8_test03

Read line is: Z:unicode_test01\level3\ΩFbΘn6m0SW9ewG∞D╞vi5sΩ

Found token #0 at position 0: Z:unicode_test01\level3\ΩFbΘn6m0SW9ewG∞D╞vi5sΩ

My test program source is below. Any help would be appreciated.

Thanks,

Rob

// RJMTestApp01.cpp : Defines the entry point for the console application.
//

#include "stdafx.h"
#include <windows.h>
#include <iostream>
#include <locale>
#include <string>
#include <atlstr.h>

TCHAR g_readFileName[MAX_PATH];

#define MAX_READ_STRING 4096+1

int _tmain(int argc, TCHAR *argv[])
{
int i = 1;
// FILE *fReadFile = NULL;
wchar_t readLine[MAX_READ_STRING];
CString Separator = _T(",");
CString Token;
int position = 0;
int previous_position = 0;
int token_count = 0;

if (argc < 2) {
_tprintf(_T("A read file is required\n"));
return 1;
}

while (i < argc) {
if (_tcsicmp(argv, _T("--readfile")) == 0 ||
_tcsicmp(argv, _T("-readfile")) == 0) {
rsize_t copylength = wcslen((const wchar_t *)argv[i + 1]) + 1;
_tcscpy(g_readFileName, argv[i + 1]);
}
else {
_tprintf(_T("Invalid option: %s\n"), argv);
return 1;
}

i += 2;
}

// Make sure the file exists
// FILE *fReadFile = _tfopen(g_readFileName, _T("r, ccs=UTF-8"));

size_t newSize = strlen(g_readFileName) + 1;
size_t convertedChars = 0;
wchar_t * w_readFileName = new wchar_t[newSize];
mbstowcs_s(&convertedChars, w_readFileName, newSize, g_readFileName, _TRUNCATE);
FILE *fReadFile = _wfopen(w_readFileName, L"r, ccs=UTF-8");
if (fReadFile == NULL) {
wprintf(L"Invalid file, not found: %ls\n", w_readFileName);
return 1;
}
wprintf(L"The read file name is: %ls\n\n", w_readFileName);

while (!feof(fReadFile)) {
// _tprintf(_T("About to read the file\n"));
if (fgetws(readLine, MAX_READ_STRING, fReadFile) != NULL) {
position = 0;
token_count = 0;
wprintf(L"Read line is: %ls\n", readLine);

CString tmp(readLine);

Token = tmp.Tokenize(Separator, position);
while (!Token.IsEmpty()) {
_tprintf(_T("Found token #%d at position %d: %s\n"), token_count, previous_position, Token.GetString());
previous_position = position;
token_count++;
Token = tmp.Tokenize(Separator, position);
}
}
}
return 0;
}

Continue reading...
 
Back
Top