演讲嘉宾

中国Spark技术峰会

梁堰波

大数据&机器学习资深实践者

I'm a software engineer with strong computer science and mathematical background. I'm interested in implementing effective statistic and machine learning algorithms based on scalable distributed system such as Apache Spark. I'm an active Apache Spark contributor, mostly worked on Spark ML/MLlib and SparkR projects. I was a software engineer at France Telecom, meituan and Yahoo! working on machine learning and distributed system successively.

演讲主题

Build generalized linear models on massive dataset

The generalized linear models (GLMs) extends the traditional linear model to be applicable to a wider range of statistical modeling problems by specifying a model family and link function. The class of GLMs has gained popularity as a statistical modeling tool due to the flexibility of GLMs in addressing a variety of statistical problems and to the availability of software to fit the models. However, leveraging the rich, validated statistical software such as R is a challenge due to the massive dataset sizes in Hadoop. In this talk, we will discuss how Spark MLlib solved the common GLMs problem on large scale dataset by Iteratively Reweighted Least Squares (IRLS) and Limited-memory BFGS (L-BFGS), their pros and cons given training datasets of different sizes, and implementation details in order to match R glm and glmnet’s model output, summary statistics and prediction. We will also demonstrate the APIs in MLlib and SparkR. This is a joint work with other Spark community members.