From the Frontline of the Research on Machine Learning in the US

May 9, 2019



-Originator of Stochastic Gradient Descent Method-

Dr. Leon Bottou
Dr. Leon Bottou
Research Lead
Facebook AI Research

Leon Bottou
-Originator of Stochastic Gradient Descent Method-

Dr. Leon Bottou is one of the leading AI researchers who proved the effectiveness of the "Stochastic Gradient Descent Method (SGD)" in deep learning when he was a researcher at NEC Laboratories America. SGD is the main optimization methods for deep learning because of its computational cost advantage and its surprising robustness. SGD also can learn online, in real time, even when one cannot load or obtain the full training set at once. In this interview he talked about his passion to analyze problems faced by people from a unique perspective, as well as his current interest in AI research.

Dr. Leon Bottou

One of the research leads of Facebook AI Research, the AI research division of Facebook. He is an AI researcher globally known in the field of machine learning and data compression, and had been in the machine learning department of NEC Laboratories America (NECLA) from 2002 to 2010. At NECLA, he mainly conducted research in machine learning using large-scale data sets, and proved the effectiveness of "Stochastic Gradient Descent Method (SGD)" in deep learning, which have greatly influenced the current AI learning algorithm. Also, his paper "The Trade-Offs of Large Scale Learning", published in 2008, won the Test of Time Award (in honor of the most influential paper in the past 10 years) at NeurIPS, one of the top machine learning conferences, in 2018. Currently, he is pursuing his interests in forefront AI research at Facebook New York.

His career began in the Adaptive Systems Research Department at Bell Laboratories, where he conducted study on the local learning algorithms, and then founded Neuristique in France. There he developed tools for machine learning and data mining. After returning to Bell Laboratories, he developed new machine learning methods and contributed to the development of handwriting recognition technology and OCR. In addition, he worked on an image compression technology known as DjVu at AT&T Labs, joined NECLA as a researcher and Microsoft as a partner scientist.
Since 2015, he has been working at Facebook AI Research.

He received degrees in engineering, mathematics, and a Ph.D. in computer science from École Polytechnique, École Normale Supérieure, and Université Paris-Sud. He is an associated editor for the IEEE's Transactions on Pattern Analysis and Machine Intelligence, the IAPR's Pattern Recognition Letters, and author or co-author of over 100 peer-reviewed papers.

My AI career started with a stroke of luck.

I was interested in machine learning when I was attending École Polytechnique in my home country, France. I had read a paper discussing neural network and tried it on the school’s computer. It worked! I was hooked! When I told my mentor that I wanted to obtain a doctoral degree in neural networks, he replied, "How do you get a doctoral degree in a discipline that does not exist?" He was right, but I was young and stubborn, so I continued. In retrospect, I think I was lucky.
In the world of physics, it is easy to model various phenomena and entities, and simple rules can represent their nature. Then analysis is possible by so-called scientific methods. However, in fields such as biology and social sciences, the targets are complex, and modeling is difficult. Therefore, we cannot handle it in a simplified way, nor can we apply a simple set of rules.
AI is similar to the latter. That is the reason why we have to deal with it by utilizing approximate simulations. In machine learning using neural networks, the inferences based on the learning results are compared with the correct values, and the parameters of each neuron are adjusted so that the error is minimized. However, the problem here is the trade-off between computational cost and accuracy. In general, the more data samples you use to calculate errors, the more accurate the learning result, but at a cost in computer time.
So, I thought about what would be the most efficient way to minimize the error assuming that training data is unlimited but computer time is bounded. In this problem, there are several options, such as processing all the data at once or dealing with small amounts of data in quick succession. The steepest descent method, which sums up errors for all data and updates parameters, is more accurate but at the price of enormous computational cost. However, I proved that it is better to use the SGD in which the parameters are adjusted on the basis on randomly training examples.
My paper on SGD called “The Tradeoffs of Large Scale Learning” was well received by the machine learning community because, as we deal with more and more data, striking the right balance between computational cost and accuracy becomes critical. Now, relatively cheap GPU hardware can perform such computation very quickly. However, total amount of electrical power used for this is now so large that I am getting concerned.

To Get a Right Answer, You Need to Ask a Right Question

By the way, if you think about how the human brain functions, you can easily imagine that it does not use the SGD algorithm. SGD often relies on the "back propagation algorithm" that computes the contribution of each neuron to the error by progressing backwards throughout the network’s layers. It is difficult to think that back propagation processing really occurs in the brain.
Artificial neural networks are said to mimic the work of the brain, but the actual brain seems to work with simpler mechanisms on a much larger scale. The number of neurons in our brain is incomparably higher than that of artificial neurons in current neural networks, and they function with a very small amount of sugar.
Currently, the most advanced image recognition system is trained with 3 billion images. This is comparable to the number of scenes that humans will see in their lifetime. Also, in the case of language modeling systems, the amount of texts used in the learning process can be more than one can read during 10 lives. Paradoxically, this gives us an idea of how efficiently our brain is acquiring the cognitive skills needed to live everyday life. It has not been elucidated yet what kind of mechanism can make that possible.
In any case, no matter what the actual brain mechanism is, if you try to realize AI using current computers, you might settle for neural networks. It is realistic to think about how to improve the efficiency on that basis.
From that point of view, my emphasis is on asking the "right questions". I want to understand the complex issues people are facing, but in this world there are still a lot of things I do not know. And, I do not know how to ask questions about things I do not know. So we can start by looking at the application dealing with real problems and at the results of applying machine learning techniques. Is the problem actually solved? And if not, why?
In many cases, I think that a lot of issues arise from our inability to model the chain of causes and effects. By making or learning better causals models, you get semantically meaningful answers. But the first step remains asking the right question.

The Problem of Bias and Inefficiency in Deep Learning

Let's talk a little bit about my recent interest. For example, how to recognize a scene where a cat is walking. If there is no formal specification to recognize this scene, there are two ways of thinking.
One is a heuristic approach, that is, a method of making a specification by trial and error and judging it. This is a classic way. The other one is to uses a proxy program to create an approximate specification. This is the machine learning way.
A couple years ago, a colleague and I worked on a system to recognize whether someone in a picture is giving a phone call. Although the recognition rates on testing data were satisfactory, we found that the machine found that "a person is giving a phone call" whenever the picture shows a phone close to that person.
As you know, just having a phone on your desk or having a phone in your hand does not mean you are actually giving a phone call. However, the system claimed that such persons were all giving phone calls.
Surprisingly, this result is statistically correct: if you search the Internet for photos of a person near a phone, it is very likely that person is giving a phone call. This is why we had good recognition rates for a system that essentially had missed the point of what it means to give a phone call.
This is due to the bias of the learning samples from the real world, where photos of a person with a phone and giving a phone call are overwhelmingly dominant. They are dominant because that situation is more photogenic, probably.
From this, I believe that we need a way to reduce the dependence on real statistical distribution to avoid creating systems that unwisely exploit superficial correlations.
It is also true that deep learning will reach its limits because it currently needs too much data. If one needs more text than a human can read in many lives to train a language recognition system, something is already wrong.
Well, I think that finding what idea comes after deep learning is the biggest problem in AI. This is why I am working on this problem. I hope I will find it, but somebody else may find it first too.

Supportive NECLA community and suitable environment for research

Regarding NECLA, I would like to praise the fact that NEC did not stop supporting R & D even during its financially difficult period. That was a brave decision.
Princeton, where NECLA is located, is, in a sense, a peaceful island. There is also a canal and a pond, both suitable for taking a walk. I think Japanese people can understand it well that it is necessary to have such a quiet environment to organize your ideas and to identify what is the top priority.
We created a small group to study machine learning and conducted research with talented researchers such as Jason Weston, Ronan Collobert, Vladimir Vapnik, and Yann LeCun. The comfortable work environment of NECLA allowed those motivated researchers to focus very seriously on research so that we could accomplish things as a team that individuals could not achieve.
I left NECLA for several reasons: I had a strong desire to eventually return to France; several of my fellow researchers left for other places or retired. And the department headed toward more applied research side. I think the shift to applied research is understandable from a business point of view, but my motivation for research always is the feeling of wanting to learn something, so I wanted to pursue both fundamental and applied research. By the way, about going back to France, it has been postponed by getting married ....
And now, AI research is now very fashionable. Maybe too fashionable. I think this has happened because computers have already transformed all the fields that could be transformed using classic software. Something new is needed to find further growth. Therefore, machine learning and AI appeals to many companies.
Also, regarding intellectual property rights, for example, is mathematics is owned by someone? No. And many people think software algorithms should not be owned by anyone either.
Some companies are tempted to secure IPs on AI research because it could not have been realized without corporate resources. However, past experience shows that this attitude tends to slow down progress and sometimes kill the research. Therefore, I am glad when the company I work for takes a very open approach to research.

Following the passion is the researcher's belief

My advice to students and researchers who are interested in AI is "follow your own passion". The road of research has ups and downs. So, we need passion to continue. In addition, while pursuing your own research theme, sometimes it is important to look around. Because there is something to learn from everywhere.
I have never thought about money. It is not a good idea to do research for rewards. One of my uncles used to be a surgeon in Casablanca, and he told me that he just does his best in treating his patients. Money then takes care of itself. In developed countries, a student who study computer science and machine learning has a guaranteed future. If you do not feel anxious about your life, you should be able to follow your passion without any hesitation.

(interview and text / Kazutoshi Otani)



-Originator of Stochastic Gradient Descent Method-