EDN Admin
Well-known member
Hi,I want to determainte a string to see is UTF-8 encdoing or not, I have tried to determaint use follow rule(from Wikipedia)
<ol style="widows:2; text-transform:none; background-color:#ffffff; text-indent:0px; margin:0.3em 0px 0px 3.2em; letter-spacing:normal; font:13px/19px sans-serif; white-space:normal; orphans:2; color:#000000; word-spacing:0px; padding:0px
<li style="margin-bottom:0.1em Every valid ASCII character is also a valid UTF‑8 encoded Unicode character with the same binary value. (Thus, valid ASCII text is also valid UTF‑8-encoded Unicode text.)
<li style="margin-bottom:0.1em For every UTF‑8 byte sequence corresponding to a single Unicode character, the first byte unambiguously indicates the length of the sequence in bytes.
<li style="margin-bottom:0.1em All continuation bytes (byte nos. 2â6 in the table above) have<span> <code style="background-color:#f9f9f9; font-family:monospace,Courier New 10[/code]<span> as their two most-significant bits
(bits 7â6); in contrast, the first byte never has<span> <code style="background-color:#f9f9f9; font-family:monospace,Courier New 10[/code]<span> as its two most-significant bits. As a result, it is immediately obvious whether
any given byte anywhere in a (valid) UTF‑8 stream represents the first byte of a byte sequence corresponding to a single character, or a continuation byte of such a byte sequence.
<li style="margin-bottom:0.1em As a consequence of no. 3 above, starting with any arbitrary byte anywhere in a (valid) UTF‑8 stream, it is necessary to back up by only at most five bytes in order to get to the beginning of the byte sequence corresponding to
a single character (three bytes in actual UTF‑8 as explained in the next section). If it is not possible to back up, or a byte is missing because of e.g. a communication failure, one single character can be discarded, and the next character be correctly read.
<li style="margin-bottom:0.1em Starting with the second row in the table above (two bytes), every additional byte extends the maximum number of bits by five (six additional bits from the additional continuation byte, minus one bit lost in the first byte).
<li style="margin-bottom:0.1em Prossers and Thompsons scheme was sufficiently general to be extended beyond 6-byte sequences (however, this would have allowed FE or FF bytes to occur in valid UTF-8 textâsee under Advantages in section " http://en.wikipedia.org/wiki/UTF-8#Compared_to_single-byte_encodings Compared
to single byte encodings " belowâand indefinite extension would lose the desirable feature that the length of a sequence can be determined from the start byte only).
</ol>
<p style="margin-bottom:0.1em <span style="font-family:Arial It work fine for normal file.In below scenario it say encoding by UTF-8, but the file encoding by ANSI.<br/>
the file contains 2 characters which are ╝┬.(first chars ANSI code:194 and second one is 188). so how to handle this ?
<p style="margin-bottom:0.1em <span style="font-family:Arial Thanks
View the full article
<ol style="widows:2; text-transform:none; background-color:#ffffff; text-indent:0px; margin:0.3em 0px 0px 3.2em; letter-spacing:normal; font:13px/19px sans-serif; white-space:normal; orphans:2; color:#000000; word-spacing:0px; padding:0px
<li style="margin-bottom:0.1em Every valid ASCII character is also a valid UTF‑8 encoded Unicode character with the same binary value. (Thus, valid ASCII text is also valid UTF‑8-encoded Unicode text.)
<li style="margin-bottom:0.1em For every UTF‑8 byte sequence corresponding to a single Unicode character, the first byte unambiguously indicates the length of the sequence in bytes.
<li style="margin-bottom:0.1em All continuation bytes (byte nos. 2â6 in the table above) have<span> <code style="background-color:#f9f9f9; font-family:monospace,Courier New 10[/code]<span> as their two most-significant bits
(bits 7â6); in contrast, the first byte never has<span> <code style="background-color:#f9f9f9; font-family:monospace,Courier New 10[/code]<span> as its two most-significant bits. As a result, it is immediately obvious whether
any given byte anywhere in a (valid) UTF‑8 stream represents the first byte of a byte sequence corresponding to a single character, or a continuation byte of such a byte sequence.
<li style="margin-bottom:0.1em As a consequence of no. 3 above, starting with any arbitrary byte anywhere in a (valid) UTF‑8 stream, it is necessary to back up by only at most five bytes in order to get to the beginning of the byte sequence corresponding to
a single character (three bytes in actual UTF‑8 as explained in the next section). If it is not possible to back up, or a byte is missing because of e.g. a communication failure, one single character can be discarded, and the next character be correctly read.
<li style="margin-bottom:0.1em Starting with the second row in the table above (two bytes), every additional byte extends the maximum number of bits by five (six additional bits from the additional continuation byte, minus one bit lost in the first byte).
<li style="margin-bottom:0.1em Prossers and Thompsons scheme was sufficiently general to be extended beyond 6-byte sequences (however, this would have allowed FE or FF bytes to occur in valid UTF-8 textâsee under Advantages in section " http://en.wikipedia.org/wiki/UTF-8#Compared_to_single-byte_encodings Compared
to single byte encodings " belowâand indefinite extension would lose the desirable feature that the length of a sequence can be determined from the start byte only).
</ol>
<p style="margin-bottom:0.1em <span style="font-family:Arial It work fine for normal file.In below scenario it say encoding by UTF-8, but the file encoding by ANSI.<br/>
the file contains 2 characters which are ╝┬.(first chars ANSI code:194 and second one is 188). so how to handle this ?
<p style="margin-bottom:0.1em <span style="font-family:Arial Thanks
View the full article