Way back in October I noticed this WHATWG HTML bug (26942) where someone asked why do these examples of <html>
lack the lang
attribute?
I thought the answer from Hixie was a bit dismissive and not based on any data or real-world benefits of use, particularly in the context of screen readers:
Why not? Realistically, few people include it. It just means the language is unknown.
At the time, I could not get the latest archive to download from WebDevData.org (though that has changed, see below), so I fell back to asking for help on why the lang
attribute is valuable.
How the lang
Attribute on <html>
Is Used
I got lots of good bits of feedback, which I collected into a Storify. I've distilled all that great information to these key points:
- VoiceOver on iOS uses the attribute to auto-switche voices.
- VoiceOver can speak a particular language using a different accent when specified.
- Leaving out the
lang
attribute may require the user to manually switch to the correct language for proper pronunciation.
- JAWS uses it to load the correct phonetic engine / phonologic dictionary — Handy for sites with multiple languages.
- NVDA (Windows) uses it in the same way as VoiceOver and JAWS.
- When used in HTML that is used to form an ePub or Apple iBooks document, it affects how VoiceOver will read the book.
- Firefox, IE10, and Safari (as of a year ago) only support CSS
hyphens: auto
when the lang
attribute is set (not from Twitter; source).
In the absence of setting a lang
attribute on the <html>
element, screen readers will fall back to the user's default system setting (barring any custom overrides) when speaking content.
How Many Pages Use lang
On January 8, WebDevData.org (from a W3C Community Group) posted its latest archive (which did not error on download, woo!). It consists of the HTML from 87,000 web pages.
I pulled down the 780MB file and re-taught myself the skills necessary to parse the files. For those who are regular expression geniuses, you are welcome to suggest an alternate approach, but I used the following pattern to return all the <html>
elements: <html([^>]+)>
. It fails for any <html>
with no attributes at all, but for what I am doing that's ok.
Of the 84,054 pages I parsed (I excluded XML, ISO files, and so on), I found that 39,433 use the lang
attribute on the <html>
element. That's just about 47% (46.914% if I understand significant digits correctly).
What that tells me is that instead of the case being that few people include it,
nearly half the web includes it.
There are 12,672 instances of xml:lang
, though at a quick scan they appear alongside lang
. If anyone with better regex skills would like to help me further parse, please let me know.
Why You Should Use the lang
Attribute on the <html>
Element
Hyphens
By using lang
, you get the benefits of hyphen support in your (modern) browser that you otherwise would not get (assuming you use hyphens: auto
in your CSS).
Accessibility
At the very least, lang
is a benefit for screen reader users, particularly when your users don't have the same primary language as your site. It allows proper pronunciation and inflection when the page is spoken.
WCAG Compliance
Including the lang
is a Level A requirement of the Web Content Accessibility Guidelines 2.0 (specifically item 3.1.1 Language of Page). Technique H57 identifies the lang
attribute specifically.
Internationalization
The W3C Internationalization (I18n) Activity has a great Q&A on why you should use lang
, which was updated less than two months ago. I'll reprint the start of the answer, but there is far more detail and I strongly recommend you go read it.
Identifying the language of your content allows you to automatically do a number of things, from changing the look and behavior of a page, to extracting information, to changing the way that an application works. Some of language applications work at the level of the document as a whole, some work on appropriately labeled document fragments.
We list here a few of the ways that language information is useful at the moment, however, as specifications and browsers evolve in the future there could be numerous additional applications for language information.
Interesting Aside
If you go to the WHATWG HTML5 specification today and view the page source, you'll see the following language declaration in the code:
<html class=split data-revision="$Revision: 8877 $" lang=en-GB-x-hixie>
Not to be outdone, the W3C HTML5 spec has the same language declaration.
If anybody has the en-GB-x-hixie
phonologic dictionary in his or her screen reader, I'd love to hear it.
While technically allowed (the -x
puts it in the private use sub-tag category), it's bad form:
Private-use subtags do not appear in the subtag registry, and are chosen and maintained by private agreement amongst parties.
Because these subtags are only meaningful within private agreements and cannot be used interoperably across the Web, they should be used with great care, and avoided whenever possible.
Update: January 1, 2015
For what it's worth, I've filed bugs against the W3C HTML5 spec and the WHATWG HTML5 spec.
Update: February 25, 2015
Another case where a lang
attribute is important, though in this case on a specific element, is outlined in the piece HTML5 number inputs – Comma and period as decimal marks:
<input type="number"> will open a numeric software keyboard on modern mobile operating systems. Not every user can input decimal numbers into this convenient field without proper localization.
[…]
Half the world uses a comma and the other half uses a period as their decimal mark. (In Latin scripts.) Does your web application take that into consideration? Do the browsers?