• balsoft@lemmy.ml
        link
        fedilink
        arrow-up
        24
        ·
        edit-2
        24 hours ago

        I understand Unicode and its various encodings (UTF-8, UTF-16, UTF-32) fairly well. UTF-8 is backwards-compatible with ASCII and only takes up the extra bytes if you are using characters outside of the 0x00-0x7F range. E.g. this comment I’m writing is simultaneously valid UTF-8 and valid ASCII.

        I’d like to see some good evidence for the claim that Unicode support increases memory usage so drastically. Especially given that most data in RAM is typically things other than encoded text (e.g. videos, photos, internal state of software).

        • Frezik@lemmy.blahaj.zone
          link
          fedilink
          arrow-up
          14
          ·
          24 hours ago

          It’s not so much character length from any specific encodings. It’s all the details that go into supporting it. Can’t assume text is read left to right. Can’t assume case insensitivity works the same way as your language. Can’t assume the shape of the glyph won’t be affected by the glyph next to it. Can’t assume the shape of a glyph won’t be affected by a glyph five down.

          Pile up millions of these little assumptions you can no longer make in order to support every written language ever. It gets complicated.

          • The_Decryptor@aussie.zone
            link
            fedilink
            English
            arrow-up
            1
            ·
            16 hours ago

            Yeah, but that’s still not a lot of data, like LTR/RTL shouldn’t be varying within a given script so the values will be shared over an entire range of characters.

    • taaz@biglemmowski.win
      link
      fedilink
      arrow-up
      4
      ·
      1 day ago

      don’t know the details but my general IT knowledge says that: single unicode character/glyph can take up to 4 bytes instead of 1 (ascii).

      • balsoft@lemmy.ml
        link
        fedilink
        arrow-up
        13
        ·
        24 hours ago

        Yes, but that shouldn’t generally explode the RAM usage by an order of magnitude. In terms of information amount, most of the data that computers handle is an internal binary representation of real-world phenomena (think: videos, pictures, audio, sensor data) and not encoded text.

      • Marek Knápek@programming.dev
        link
        fedilink
        arrow-up
        6
        ·
        23 hours ago

        No! One code point could be encoded by up to 4 UTF-8 code units, not glyph. Glyphs do not map to code points one to one. One glyph could be encoded by more than one code point (and each code point could be encoded by more than one code unit). Code points are Unicode thing, code units are Unicode encoding thing, glyphs are font+Unicode thing. For example the glyph á might be single code point or two code points. Single code point because this is common letter in some languages, and was used in computers before Unicode was invented, two code points because this might be the base letter a followed by an diacritic combining mark. Not all diacritic letters have single code point variant. Also emojis, they are single glyph but multiple code points, for example skin tone modifier for various faces emojis, or male+female characters combined into single glyph forming a family glyph. Also country flags are single glyph, but multiple code points. Unicode is BIG, there are A LOT of stuff in it. For example sorting based on users language, conversion to upper/lower case is also not trivial (google the turkish i).