The PHP podcast where everyone chimes in.

Originally aired on

May 30th, 2016

046: Character Encoding and UTF-8 in PHP

If you've ever gotten a number of weird looking characters in your database or on your website like, "" and didn't know why, then this episode is for you. Those bizarre characters called "mojibake", rear their ugly heads when we don't account for a consistent character encoding. Today we discuss what character encoding is, how to accommodate for it in HTML, PHP & your database, and how we can ensure we'll never encounter an unexpected alien character in our web apps again.

with


Character Encoding and UTF-8 in PHP Show Summary


What are character encodings?

  • Comparison of number systems (base-10, binary, hexadecimal)
  • Historical development of character maps and encodings (telegram, ASCII, ISO-8859-1...)
  • Unicode and its implementations (UTF-8, UTF-16, UTF-32)
    • UTF-8 uses 1-4 bytes per character--shows all Unicode characters while keeping a small file size
    • UTF-16 uses 2 bytes per character, so results in larger files
    • UTF-32 uses 4 bytes per character, so results in even larger files

Character encodings in PHP

Character encodings in HTML

Character encodings in MySQL

  • MySQL default encoding is still latin1_swedish_ci--beware
  • Only utf8mb4 supports the full range of UTF-8 characters (as people have discovered from trying to store emoji)
  • varbinary and blob consume less space than utf8 varchar, so they are useful in fields that users won't touch or whose contents never need to include special chars (e.g. URLs)
    • Is there any other tradeoff?

Tips/best practices

  • Save source files as UTF-8 without the Byte Order Mark (BOM).
  • Always use <meta> and a Content-Type header.
    • Remember that header() must precede any echoed output. (Presence of a BOM can cause bugs here.)
  • Storing UTF-8 special characters in their native form (rather than as escaped sequences) in your source can act as "canary" for others--if I see mojibake, maybe my IDE/editor is not configured correctly?
  • For MySQL, use SET NAMES utf8 at the beginning of every connection
  • Always validate inputs; always consider character encoding for input (whether from user or APIs), persistence, and output.
  • Remember that connections/clients themselves also have encodings
  • Mismatched encoding bugs can hide among Roman alphanumerics, since ISO-8859 and UTF-8 are compatible in these lower code points. Check higher code points to be sure.

Other resources

Andreas Heigl



Developer Shout-Out

The Developer Shout-Out recognizes developers in the community for their contributions.

For this episode the panel guests, Andreas and Evert nominated Michael Cullum for the Developer Shout-Out segment.

Thank you, Michael Cullum for your excellent cat herding skills and work on @phpfig 3.0. A $50 Amazon gift card is on its way to you.

$50 Amazon gift card sponsored by Laracasts

Laracasts

It's like Netflix for developers.

Show Notes Credit

Thank you Dominic Bordelon for authoring the show notes for this episode!

If you'd like to contribute show notes and totally get credit for it, check out the show-notes repo!