Developers and others working with computers like to automate things and testing is one of the top candidates to automate. When software grows and gets more complex it takes more time to test it. Sometimes new features or improvements break something that worked before. That goes also for all of software, also web and mobile. For example when we need to extend a template to support new design we can easily break the old one if we are not careful. The same problematic can also be applied for accessibility and if we do not test related parts we can quickly introduce new accessibility problems.
So automatic testing is a logical desire for every aspect of the software and accessibility is again not an exception. Usually almost all aspects of software can be tested automatically. Even the graphical parts can nowadays be tested with automatic tests, so our minds most certain wonder if we can also automate testing for accessibility. As we are used of Web Content Accessibility Guidelines (WCAG) and their success criteria we immediately research possibilities of automatically testing all success criteria on level A and AA. As that is most probably our target, to assure compliance and be on the safe side. So first and most important question is before us – what is and what is not possible?
Is it possible to automatically test all WCAG success criteria?
No! Let’s get real and honest. It is not possible to automatically test all of WCAG success criteria. There are some parts of WCAG that are not related with code at all and can therefore also not be tested with code. There are some parts of WCAG that has to have correct code as a basis but it alone is not enough to automatically test them. And there are also parts of WCAG that can really be tested automatically.
We live in amazing times and our hardware allows us to do incredible things that were before only possible in imagination or as a mathematical concept. The rise of machine learning and also artificial intelligence has most likely crossed your screen and maybe even printed article, and I think we can agree that it really expands our possibilities. Also for automatic testing. Pattern recognition, natural language processing, even more abstract concepts as emotions are now transforming the digital world and making possibilities for some pretty spectacular software solutions. The amounts of data fed to these machine learning machines is exponentially rising and it really allows them to learn more and more.
But can this evolution also help us to automatically test for WCAG success criteria that are more complex? I think we are still not there yet. Some companies are working on it and getting quite amazing results. For example finding about missing semantics in the code based on the graphical user interface is definitely an useful concept. If we can detect a button in the graphical interpretation of the code but not in the code then we can flag that as a possible WCAG error and let human verify. This loop will most probably also improve the statistical side of the algorithms and reinforce their learning so that after some time we will get much more certainty and our tools will be much better. I am certainly not familiar with all the possibilities of machine learning and artificial intelligence but I was thinking of some possibilities that can be improved with help from human specialists and reinforced learning, if there is enough data. And I can certainly say that there is enough data with almost 1.2 billion websites in the world (opens in new window). So if I may check the crystal ball and predict the future – more and more WCAG success criteria will be at least flagged as potential errors automatically but again – we will have to wait for some time to get more certain results.
Automatic testing can often only find what is failing, not what is passing
Another problem with automatic testing is the fact that often it can only define that we did not failed on a specific WCAG success criteria. So this does not mean that we have actually passed it. Again it requires human intervention. Basic example of this is the WCAG success criteria 1.1.1 called Non-text Content (level A). We can for example test that image has alt attribute and that it is for example empty. We know that then this image is decorative. But is it really? Software alone can not tell us that. Maybe in cases where this image is wrapped in an link and this link also does not have a name. But in other cases it requires a human to check if it is correct. So we can conclude – automatic testing can capture missing alt attribute and that is a definitive fail. But it will take a lot of time before automatic tests can decide if alternative text itself is really appropriate or not. So this test can not really tell us if it has passed or not.
Some automatic tests are usable when coding
I am referring to linting actually. When making small pieces of product accessible we are also making the whole product more accessible. So testing earliest possible is the best practice. And linting helps us a lot. It can catch missing or maybe even misspelled attributes. For example “aria-labelledby” is often written wrongly and a linter can prevent that early. Sometimes developers may also use values for attributes that are not supported. And sometimes some aria also needs additional attributes to work. So these kind of automatic tests are absolutely worth using.
At the same time when we are developing small pieces of software it is much more manageable to also test them. And besides linting we can then use in-browser debugging tools especially made for accessibility, like browser extensions and even default developer tools. Then we can again try to catch some simple, syntax based errors and they can even help with additional tests like color contrast and link names etc. But again – they do produce false positives from time to time and therefore human knowledge is again required for effective and real results.
Some automatic tests return false results
I’ve been in those situations quite often – running automatic tests and getting back tens or even hundreds of errors. Trusting them by default and checking the code first. And then returning to the results to find if they really found an error. And the more complex was the website the more often I got back false positives. Color contrast below 4.5:1 ratio was wrong quite often. Too often to really trust it. And I totally get why. Because sometimes we have some CSS applied async or maybe even some JavaScript changing it before tool was able to catch the last state, for example. There are even some bugs in the testing tools itself that can cause false positives. After all testing tools are software as well and not prone to bugs. So would I stop my continuous integration pipeline on an accessibility test result failing? Well it depends, some rules are to be trusted, some not.
Automatic tools vendors may produce different results for same tests
Yes and this is not something special. I’ve used different test tools on same code and noticed that quite often. That was a bit special to me. How can we get different results for same code? How is it possible that sometimes even same test rules – Accessibility Conformance Testing (ACT) rules (opens in new window) produce different results? Well it turns out that HTML and CSS and JavaScript can sometimes even be interpreted differently. Browsers are pretty good at fixing imperfect HTML for example. While automatic tools may not be so good at this. And there is also the building of Document Object Model (DOM tree) by the tool itself versus using the one built by the browser. Some tools can not use the browser to fix that for them and must therefore implement their own parsing of code. So there we can even get differences between how the real browser parses it and how the browser-less tool does it.
Other differences may also occur due to different interpretations or more correct to say different impact estimations. I’ve seen some tools over-report and some under-report the same problem. In aXeSiA I tried to run multiple tools at the same time and then combine the results in a way. I noticed that sometimes one tool just stopped working for some webpages while the other one did not. That was making a bit of problems with analysis. How can I use multiple results if some tools even fail to run on the site? Well it turned out that pages had some issues that some tools overcome but others failed to run. That made me investigate the tools and I’ve got some insight about how they actually build the DOM, CSSOM and Accessibility Object Model. That was the problem – they did it differently and some tools failed to go around problems on the source of the site while some made it through. I think that this explains a lot. I still try to use different tools as much as possible for the same reasons – some tools are better in some tests and some in others. So, to get a better overview it makes sense to use multiple tools but then it also takes more time to do it.
Tools can mislead with their scoring
Perfect example of misleading scoring is the Google Lighthouse with its accessibility testing. It is very easy to get a score on accessibility that states 100. This can give us a totally wrong sense of real situation though. 100 does not mean the site is actually accessible! It only means that we passed some of the automatic tests. And as we can remember automatic test can not cover all of WCAG success criteria. And even if they were – WCAG alone does not guarantee accessibility it just defines guidelines that help us do that. So Lighthouse score 100/100 for accessibility means that webpage passed all automatic tests that Google decided to use. As we know Google Lighthouse is using Deques axe behind the scenes (opens in new window). They decided to not use all the test rules from axe but only a subset of them.
To be fair there is a text warning besides the score, but unfortunately I do not think it has good enough effect against the psychological effect of the score 100 beside it. I think they should make the situation more clear. People can be quickly mislead by thinking that their site is perfectly accessible and that there is nothing else to improve. Maybe the design will change there but the score 100 is in my opinion still totally misleading. If axe can only test about 30% of all WCAG and if Google Lighthouse only took subset of it – how can we then get a score of 100? If we take WCAG 2.1 on level AA then we have 50 success criteria. That should be the goal to be compliant (the common standard in multiple legislation). So if automatic testing can get only about 15 success criteria (30% of 50), then maybe the score in Lighthouse should be from 0 – 30 % as well + informing the user about the rest (70%) that has to be tested and verified manually.
Tools are not perfect but they do help
To conclude – I am not saying to drop the tools. I am just saying that we need to be informed about them, their downsides and possibilities. At the same time we must know about the fact that they can produce false positives and that scores they produce also needs us to know about the reasoning, methodology and differences between tools and interpretation.
I will still use the linting when I develop, checking the components with browser tools and run the automatic tools in continuous integration pipelines and as crawling. Although not perfect they can help and everything that helps is an advantage.