www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - How to know whether a file's encoding is ansi or utf8?

reply "Sam Hu" <samhudotsamhu gmail.com> writes:
Greetings!

As subjected,how can I know whether a file is in UTF8 encoding or 
ansi?

Thanks for the help in advance.

Regards,
Sam
Jul 22 2014
next sibling parent "Sam Hu" <samhudotsamhu gmail.com> writes:
On Tuesday, 22 July 2014 at 09:50:00 UTC, Sam Hu wrote:
 Greetings!

 As subjected,how can I know whether a file is in UTF8 encoding 
 or ansi?

 Thanks for the help in advance.

 Regards,
 Sam
Sorry,I mean by by code,for example,when I try to read a file content and printed to a text control in GUI,or to console,will proceed differently regarding file encoding.
Jul 22 2014
prev sibling next sibling parent reply "FreeSlave" <freeslave93 gmail.com> writes:
On Tuesday, 22 July 2014 at 09:50:00 UTC, Sam Hu wrote:
 Greetings!

 As subjected,how can I know whether a file is in UTF8 encoding 
 or ansi?

 Thanks for the help in advance.

 Regards,
 Sam
By ANSI do you mean Windows code pages? Text editors usually use some heuristics (statistical analysis for example) to determine encoding of file. Note that these methods are not always accurate, so you need to provide ability to choose other encoding for users.
Jul 22 2014
parent "Sam Hu" <samhudotsamhu gmail.com> writes:
On Tuesday, 22 July 2014 at 11:09:36 UTC, FreeSlave wrote:
 On Tuesday, 22 July 2014 at 09:50:00 UTC, Sam Hu wrote:
 Greetings!

 As subjected,how can I know whether a file is in UTF8 encoding 
 or ansi?

 Thanks for the help in advance.

 Regards,
 Sam
By ANSI do you mean Windows code pages? Text editors usually use some heuristics (statistical analysis for example) to determine encoding of file. Note that these methods are not always accurate, so you need to provide ability to choose other encoding for users.
Thanks. Yes.It is Windows related again...I found that writefln() can print ansi encoding files into console and shows its content correctly under asia font environment,but this does not work for files with UTF8 encoding;On the other hand,Tango 4 D2 branch can print files with UTF8 encoding into console and shows its content correctly under asia font environment.I tried a 'both-way' with Tango but failed.So I just have a silly idea when I encountered a file to be printed to the console,I choose writefln or Tango's Stdout.formatln depending on the file encoding.
Jul 22 2014
prev sibling next sibling parent reply "Alexandre" <alebencz gmail.com> writes:
Read the BOM ?

module main;

import std.stdio;

enum Encoding
{
	UTF7,
	UTF8,
	UTF32,
	Unicode,
	BigEndianUnicode,
	ASCII
};

Encoding GetFileEncoding(string fileName)
{
	import std.file;
	auto bom = cast(ubyte[]) read(fileName, 4);

	if (bom[0] == 0x2b && bom[1] == 0x2f && bom[2] == 0x76)
		return Encoding.UTF7;
	if (bom[0] == 0xef && bom[1] == 0xbb && bom[2] == 0xbf)
		return Encoding.UTF8;
	if (bom[0] == 0xff && bom[1] == 0xfe)
		return Encoding.Unicode; //UTF-16LE
	if (bom[0] == 0xfe && bom[1] == 0xff)
		return Encoding.BigEndianUnicode; //UTF-16BE
	if (bom[0] == 0 && bom[1] == 0 && bom[2] == 0xfe && bom[3] == 
0xff)
		return Encoding.UTF32;

	return Encoding.ASCII;
}

void main(string[] args)
{
	if(GetFileEncoding("test.txt") == Encoding.UTF8)
		writeln("The file is UTF8");
	else
		writeln("File is not UTF8 :(");
}



On Tuesday, 22 July 2014 at 09:50:00 UTC, Sam Hu wrote:
 Greetings!

 As subjected,how can I know whether a file is in UTF8 encoding 
 or ansi?

 Thanks for the help in advance.

 Regards,
 Sam
Jul 22 2014
parent reply "Sam Hu" <samhudotsamhu gmail.com> writes:
On Tuesday, 22 July 2014 at 11:59:34 UTC, Alexandre wrote:
 Read the BOM ?

 module main;

 import std.stdio;

 enum Encoding
 {
 	UTF7,
 	UTF8,
 	UTF32,
 	Unicode,
 	BigEndianUnicode,
 	ASCII
 };

 Encoding GetFileEncoding(string fileName)
 {
 	import std.file;
 	auto bom = cast(ubyte[]) read(fileName, 4);

 	if (bom[0] == 0x2b && bom[1] == 0x2f && bom[2] == 0x76)
 		return Encoding.UTF7;
 	if (bom[0] == 0xef && bom[1] == 0xbb && bom[2] == 0xbf)
 		return Encoding.UTF8;
 	if (bom[0] == 0xff && bom[1] == 0xfe)
 		return Encoding.Unicode; //UTF-16LE
 	if (bom[0] == 0xfe && bom[1] == 0xff)
 		return Encoding.BigEndianUnicode; //UTF-16BE
 	if (bom[0] == 0 && bom[1] == 0 && bom[2] == 0xfe && bom[3] == 
 0xff)
 		return Encoding.UTF32;

 	return Encoding.ASCII;
 }

 void main(string[] args)
 {
 	if(GetFileEncoding("test.txt") == Encoding.UTF8)
 		writeln("The file is UTF8");
 	else
 		writeln("File is not UTF8 :(");
 }



 On Tuesday, 22 July 2014 at 09:50:00 UTC, Sam Hu wrote:
 Greetings!

 As subjected,how can I know whether a file is in UTF8 encoding 
 or ansi?

 Thanks for the help in advance.

 Regards,
 Sam
Thanks. This is exactly what I want at this moment.
Jul 22 2014
parent reply "FreeSlave" <freeslave93 gmail.com> writes:
Note that BOMs are optional and may be not presented in Unicode 
file. Also presence of leading bytes which look BOM does not 
necessarily mean that file is encoded in some kind of Unicode.
Jul 22 2014
parent "Alexandre" <alebencz gmail.com> writes:
http://www.architectshack.com/TextFileEncodingDetector.ashx

On Tuesday, 22 July 2014 at 15:53:23 UTC, FreeSlave wrote:
 Note that BOMs are optional and may be not presented in Unicode 
 file. Also presence of leading bytes which look BOM does not 
 necessarily mean that file is encoded in some kind of Unicode.
There are several difficulties in this case ...
Jul 22 2014
prev sibling parent "Kagamin" <spam here.lot> writes:
I first try to load the file as utf8 (or some 8kb at the start of 
it) with encoding exceptions turned on, if I catch an exception, 
I reload it as ansi, otherwise I assume it's valid utf8.
Jul 24 2014