jueves, 20 de septiembre de 2018

A game changer takes on cricket’s statistical problem

Jehangir Amjad has done something few people can: He found a way to combine his favorite sport with his work. A longtime cricket enthusiast and player, he’s currently tackling an important statistical problem in the game — how to declare a winner when a match must end prematurely, due to weather or other circumstances. Given cricket’s global popularity, and the fact that matches can last for several hours, it’s a problem of great interest to fans and players alike.

For Amjad, it’s also a project that incorporates his passion for operations research. And the Laboratory for Information and Decision Systems (LIDS) was the perfect place for him to explore it.

Amjad took a circuitous path to MIT. Born and raised in Pakistan, he received a scholarship to complete his last two years of high school at the Red Cross Nordic United World College in Norway. Along with the school’s 200 other students, who came from over 100 countries, he studied, made personal and professional connections, and learned how to live with people of many different cultures during his time there. He then returned home to teach for a year (following in the footsteps of his parents, who are both professors), before attending Princeton University for a bachelor's in electrical engineering.

He graduated in 2010, and assuming he was finished with school, went to Microsoft to be a product manager. After several years there, though, he felt restless. Realizing that he’d found himself increasingly drawn to data science and machine learning since starting at Microsoft, he says figured he could either stay in the tech industry and learn more about these fields on the job, or “go back to school to master the mathematical nuances of this field.” He chose academics and came to MIT in 2013 as a graduate student in the Operations Research Center. There, he collaborated frequently with LIDS students and researchers, under the supervision of LIDS Professor Devavrat Shah.

Because Shah is also a cricket fan, he and Amjad had been discussing the cricket problem for years, although Amjad didn’t land on his research project immediately. In fact, the theory that he is now applying to the cricket problem — robust synthetic control — is mostly used in economics, health policy, and political science. But because all of his work is interdisciplinary, he was able to see how to connect them. “A lot of what we train on [at LIDS] is the methods, but the applications are and should be very diverse,” Amjad says.

The current standard for international cricket games is to use the Duckworth-Lewis-Stern (DLS) method, created by British statisticians in the mid-1990s, to determine the winner when a game has to be called early. Amjad is viewing this as a forecasting problem.

“We aren’t just interested in predicting what the final score would be; we actually project out the entire trajectory for every ball, we project out what might happen on average,” he says.

He has used the robust synthetic control method to propose a solution to the forecasting problem, which has also led to a target revision algorithm like the Duckworth-Lewis-Stern method. Having back-tested their cricket results on many games, they are confident in the approach. They are currently comparing it to DLS, he says, and planning “what statistical argument we can make so that we can hopefully convince people that we have a viable alternative.”

Broadly, synthetic control is a statistical method for evaluating the effects of an intervention. In many cases, the intervention is the introduction of a new law or regulation.

“Let’s say that 10 years ago, Massachusetts introduced a new labor law, and you wanted to study the impact of that law,” Amjad explains. “This theory says you can use a data-driven approach to come up with a synthetic Massachusetts, one that that mimics Massachusetts as well as possible before the law was in place, so that you can then project what would have happened in Massachusetts had this law not been introduced.”

This creates a useful comparison point to the real Massachusetts, where the law has been in place. Placing the two side-by-side — the synthetic Massachusetts data and the real Massachusetts data — gives a sense of the law’s impact.

Amjad and his collaborators have developed a robust generalization of the classical method known as Robust Synthetic Control. In examining a problem this way, it turns out that limited and missing data do not become insurmountable obstacles. Instead, these sorts of difficulties can be accommodated, which is especially useful in the social sciences where there may not be many common data points available.

Continuing his example, he says, “the method is about using data about other states … to construct a synthetic unit. So, specifically, coming up with a synthetic Massachusetts that ends up being 20 percent like New York, 10 percent Wyoming, 5 percent something else — coming up with a weighted average of those. And those weights are essentially what is known as the synthetic control because now you’ve fixed those weights and you’re going to project that out into the future to say, ‘This is what would have happened had the law not been introduced.’”

Eventually, as research continues and more data become available to add to the synthetic unit, the accuracy of the results should improve, he says.

Amjad has used robust synthetic control in this more traditional way, as well. One of his other projects has been a collaboration with a team at the University of Washington on a study of alcohol and marijuana use to assess whether various laws have, over time, affected their sale and use. Another example he mentions as being a particularly good fit is any situation where a randomized control trial isn’t possible, such as studying the effect of distributing international aid in a crisis. Here, the moral and ethical implications of denying certain people aid make it impossible to use a randomized trial. Instead, observational studies are in order.

“You [the researcher] can’t control who gets the treatment and who doesn’t,” he says, but the results of it can be watched, recorded, and studied. As his work evolves, he’s also looking towards the future, thinking about time series forecasting and imputation.

“My work has converged on imputation and forecasting methods, whether it’s synthetic control or just pure time-series analysis,” he says.

This intersection is an emerging field of study. Econometricians historically used small data sets and classical statistics for problem solving, but with modern machine learning, options now exist that use lots of data to do approximate inference instead. Combining these approaches means you can explore the why of the problem and the prediction.

“You care both about the explanatory power and the predictive power, using these algorithms,” Amjad says. “These are designed for a larger scale, where you can still be prescriptive as well as predictive.” Elections forecasting is just one important example of the areas in which this work could be put to use.

Having defended his thesis earlier this year, Amjad is now a lecturer of machine learning at MIT’s Computer Science and Artificial Intelligence Laboratory. He says he is grateful for his time at LIDS — and all of the inspirational individuals he’s met and the groundbreaking ideas he’s come across here.

“The biggest lesson of my PhD is that it’s a journey,” he says. “LIDS is very accepting of you breaking the norm. They let people wander. And what that really helps you with is to understand that you can deal with ambiguity. If there is a problem that I don’t know about, I may never be able to completely solve it, but that won’t prevent me from thinking about it in a systematic way to hope to solve some parts of it.”



de MIT News https://ift.tt/2xslrRL

No hay comentarios:

Publicar un comentario