• taaz@biglemmowski.win
    link
    fedilink
    arrow-up
    4
    ·
    1 day ago

    don’t know the details but my general IT knowledge says that: single unicode character/glyph can take up to 4 bytes instead of 1 (ascii).

    • balsoft@lemmy.ml
      link
      fedilink
      arrow-up
      13
      ·
      24 hours ago

      Yes, but that shouldn’t generally explode the RAM usage by an order of magnitude. In terms of information amount, most of the data that computers handle is an internal binary representation of real-world phenomena (think: videos, pictures, audio, sensor data) and not encoded text.

    • Marek Knápek@programming.dev
      link
      fedilink
      arrow-up
      6
      ·
      23 hours ago

      No! One code point could be encoded by up to 4 UTF-8 code units, not glyph. Glyphs do not map to code points one to one. One glyph could be encoded by more than one code point (and each code point could be encoded by more than one code unit). Code points are Unicode thing, code units are Unicode encoding thing, glyphs are font+Unicode thing. For example the glyph á might be single code point or two code points. Single code point because this is common letter in some languages, and was used in computers before Unicode was invented, two code points because this might be the base letter a followed by an diacritic combining mark. Not all diacritic letters have single code point variant. Also emojis, they are single glyph but multiple code points, for example skin tone modifier for various faces emojis, or male+female characters combined into single glyph forming a family glyph. Also country flags are single glyph, but multiple code points. Unicode is BIG, there are A LOT of stuff in it. For example sorting based on users language, conversion to upper/lower case is also not trivial (google the turkish i).