General AI News Opinion
Mar 28, 2019 ● Shea Brown
Automation, Risk and Robust Artificial Intelligence

Professor Thomas Dietterich on the need for high reliability in socio-technical AI systems

The ways in which artificial intelligence (AI) is woven into our everyday lives can hardly be overstated. Powerful deep machine-learning algorithms increasingly predict what movies we want to watch, which ads we’ll respond to, how eligible we are for a loan, and how likely we are to commit a crime or perform well on the job.¹ AI is also revolutionizing the automation of physical systems such as factories, power plants, and self-driving cars, and the pace of deployment is rapidly increasing. However, the recent fatal failures of auto-pilot systems built by TeslaUber, and Boeing highlight the risks associated with relying on opaque and highly complex software in dangerous situations.² Mitigating the dangers posed by such systems is an area of active research called resilience engineering, but the rapid adoption of AI, coupled with its notorious lack of algorithmic transparency, makes it difficult for the research to keep pace.

One of the pioneers of machine-learning, Professor Thomas Dietterich, has spent the last several years investigating how artificial intelligence can be made more robust, especially when it's embedded in a complex socio-technical system where the combination of human and software errors can propagate into untested regimes. Paul Scharre’s recent book, “An Army of None”, inspired Prof. Dietterich to look deeper into the literature of high-reliability organizations for potential solutions. “It drove home to me the sense that it’s not enough to make the technology reliable, you need to make the entire human organization that surrounds it reliable also; understanding what that means has been taking up a lot of my time.” Prof. Dietterich spared some of that time to discuss with me his thoughts on AI ethics, AI safety, and his recent article Robust artificial intelligence and robust human organizations.

The ethical AI community is often split into two groups, with engineers on one side focusing on specific notions of fairness or algorithmic transparency and social scientists on the other who are more interested in the ethical implications for society as a whole. Your recent article seems to fall somewhere in between these groups.

There’s been an awakening within the AI community to the importance of social science in general. I really trace it to when social media came onto the scene, and people realized that social scientists could use it to study things about people. However, they needed computer scientists’ help to process the immense amount of data coming out of social media, so we started seeing the rise of computational social science. That was the initial opening.
Then people started thinking of the question of influence and the spread of ideas in social networks, and maximizing influence in social networks for advertising purposes, leading again to more contact with social scientists and economists. Then there was a realization that there could be biases that are causing harm to different sub-communities through the way these systems are working. I just got back from the AIES conference that had their second meeting, and nearly simultaneously there was the FAT* conference, and I think that huge chunks of the field are waking up to the fact that we need to be thinking about the broader socio-technical systems and not just the technical part. It’s manifesting itself in all of these different facets, and I’m pursuing one particular facet because I’m concerned with these high-risk, high-stakes decisions. I’m also involved right now as part of the team that’s putting together this 20-year roadmap for AI research that being coordinated by the CCC (the Computing Community Consortium). We’ve had three workshops, and people are saying that we can do X, Y, and Z, like getting computers to track people in meetings or in buildings and to understand their emotions. Twenty years ago people would have said: “that’s so cool”. Now everyone is saying, “but that could be abused so badly”, so we’ve really been sensitized to all the downsides of these things. Now we realize that so many of the technological improvements that we want to see in the next 20 years can have horrible consequences as well as really good ones, so it’s quite sobering.

Do you feel that regulations are needed to avert these consequences, or can companies effectively self-regulate?

I’m not an expert on law vs. regulation vs. industry practices, but it seems to me that we are seeing poor use of technology and inappropriate organizational [structures] surrounding the technology. Let’s focus on facial recognition for a moment. I’m not sure if you’ve been following the work of Joy Buolamwini at MIT, but I think she’s been doing very interesting work auditing the quality of these [commercial] face recognition systems. We don’t have any standards on the technical side for what claims a company should make for these systems and how they should make them. For example, I think that all of these corporate systems, if they are going to sell these products, need to give well-calibrated probabilities for the output statements that they’re making. It looks like some of these systems don’t give you a confidence value, and the ones that do don’t tell you what it means. A probability would be well calibrated if, when it said that [there’s] a 90% chance that these two pictures are pictures of the same person, then it would be right 90% of the time when it said that. That would be 90% over some [particular] population. Her [Buolamwini’s] experiments have shown that there are subpopulations where the number would have to be something like 60% because the systems just don’t perform very well on some subpopulations.
The ACLU did an experiment like that, but it wasn’t clear how they set their thresholds. As I describe in the paper, you would expect in a policing situation where you wanted to catch all of the bad guys, you would have to set the sensitivity of the detection very high and you would get a huge number of false alarms. Indeed, that is what we see on the South Wales policing numbers, and that would be very consistent with my experience in fraud detection or cybersecurity. Here you have a similar problem, where the main challenge is the vast number of false alarms. My point is, even if you are a very knowledgeable user of these tools, and it seems that most police departments don’t qualify as that, even if you have a Ph.D. in machine learning, it’s not obvious how you should set the thresholds on these systems in practice, and what are the best practices on how they should be applied.
Then you have to do some kind of analysis of the harms that false alarms create. This is where you need to audit the human organization as well as the software. If you are a police department and you have turned it up to 99% and are getting a huge number of false alarms, it’s not just about how many people are getting pulled in due to the false alarms. You can imagine that there are people repeatedly false alarming on these systems, so it’s not a random tax spread evenly across the population. Some fraction of the time you’re randomly selected, but it may also always happen to you! Somehow you’d like to audit these organizations and say: what is your false alarm rate and what harms are resulting from that, how do you reduce those harms, what is an acceptable level, and who is bearing the burden?

It seems that what you’re proposing would be simple black-box style testing that can be done without infringing on a vendor’s IP?

You would need to say over what database of images was this collected. It’s not appropriate to market it as a one-size-fits-all application, so you really need to customize it for each application. I’m very skeptical that these things that are currently being marketed are really going to work well in practice for their intended purpose because they are not being customized on a per-application basis. I don’t know if you need regulation, or whether we just need to educate the market. It would seem wise to me on the part of the vendors to get out in front of this issue and do it right.

Large organizations like Google are beginning to employ professional ethicists to investigate the potential risks of AI; do you have any advice for smaller teams without extensive resources to offset risk?

I feel like we don’t have the standardized tools for supporting a lot of this now. The R&D community needs to be building tools to support this.

Some tools exist, such as IBM’s “Fairness 360”, but a larger issue is whether someone is looking at the implications of the systems, such as what could possibly go wrong and who it could affect. Those are areas where smaller teams with fewer resources might struggle to gain access to that kind of broader thinking.

It needs to be part of our standard methodology. Maybe we need to develop some kind of conceptual analytical framework that people can go through, like a checklist, to think about the broader context of the system that they’re marketing. For instance, we can ask the simple question: ‘what happens with the repeated use of this system?’. The way we typically formulate our problems, we focus on the single step use of it. There is the case of YouTube video recommendations, which in machine-learning we would formulate as a contextual multi-armed bandit problem.³ Over time, the system will recommend progressively more and more extreme videos, so you can start out searching for halal dishes and end up watching Jihadist recruiting videos! It may not have occurred to the people building the system to look at the iterative effect. So that seems to be something that we should be teaching everyone to think about.

In your paper, you mention the possibility of AI monitoring the human organization. Does AI currently have the ability to recognize problems in human institutions?

This is more of a research question. I’ve been involved in trying to survey who is working in this space right now. I would say, right now, there is little ability to monitor the functioning of teams in human organizations. It’s a case where it would be great to monitor for team breakdown, but this could be easily misused by management, so exactly how to do this is unclear to me. One scenario would be to help with team training, or during limited periods of time. David Woods at Ohio State works on how to make teams resilient, and he is doing studies of the dev-ops people that keep the big cloud computing servers up under extreme conditions. I think we need more research in this area, and I’m advocating for the NSF and DoD to invest more in the study of doing teamwork better. This is a natural next step.

Are we talking about real-time monitoring of human teamwork?

It’s currently more focused on the training situation. However, you can see where I’m going in terms of worrying about autonomous weapons systems, which is probably the highest risk application we might contemplate for AI. We would want to have tremendous focus on the functioning of teams, and you would want extreme monitoring of those teams because the system is only as reliable as its weakest link. That might not be the software, it might be the human organization.

[1]: I’m using machine-learning (ML) interchangeably with AI, though ML is only one technique used to realize AI.

[2]: Note that the Boeing 737 MAX crashes were likely due to a non-redundant sensor failure which confused the autopilot system.

[3]: See here for a description of the multi-armed bandit problem.


This article originally appeared in Towards Data Science 

Article by:

Shea Brown