Whatever Goes: October 2014

Sunday, October 19, 2014

JAVA: charset intro: new String(byte[] bytes, String charsetname)

Wikipedia explains both reasonably well: UTF-8 vs Latin-1 (ISO-8859-1). Former is a variable-length encoding, latter single-byte fixed length encoding.

Latin-1 encodes just the first 256 code points of the Unicode character set, UTF-8 can be used to encode all code points.

At physical encoding level, only codepoints 0 - 127 get encoded identically; code points 128 - 255 differ by

- becoming 2-byte sequence with UTF-8

- are single bytes with Latin-1.