Andrew I. Schein, Ph.D.

Contact Information
LinkedIn Page
Google Scholar Page
Stack Overflow Profile


  • Andrew I. Schein. Notes on the CROC Curve. Unpublished. [.ps.gz] [.pdf]

  • Andrew Schein. Computation of log(\Phi(z)) For Large Negative z. In 2012, I discovered that the scipy routine scipy.stats.logcdf(z) (normal distribution) produced negative infinite values for moderately negative values of z. I submitted a patch, and now a lot of software relies on the improvement. The linked PDF provides an explanation of the mechanics of the patch and some supporting analysis. The associated issue tracker is here. [.pdf]

Invited Talks

  • Andrew I. Schein. Active Learning for Logistic Regression. Alberta Ingenuity Centre for Machine Learning, The University of Alberta. January 6, 2005. Talk Overheads [.ps].

  • Johnnie F. Caver. Novel Topic Impact on Author Attribution. Masters thesis in Computer Science. 2009. Outstanding thesis award. Co-advised with Craig Martell.
    Johnnie's thesis establishes a protocol for measuring author attribution degradation when predicting on novel topics (e.g. document topics not present in training data). Her evaluation task may be reproduced by downloading the NYT corpus, and filtering the corpus using her lists and text extractor.

Below are some of the open source software and patches I have written over the years. The earliest of these date back to sometime around the year 2000. The ordering below is approximately chronological.

  • PennAspect (link lost to the ages). A java implementation of the Aspect model, a belief network popular in many communities under various names. In the natural language processing and data mining worlds the names "aspect model" or "probabilistic latent semantic indexing" are prevalent terms for this model. A third party has translated our code into C++ and incorporated it into the Lemur toolkit.

  • Logistic PCA A principal component analysis technique for binary data. Implements the model-fitting strategy introduced in my paper, A generalized linear model for principal component analysis of binary data. The code is implemented as a Matlab procedure.

  • PCLR An algorithm that predicts protein localization to the chloroplasts in plants. A web-based version of the algorithm described in Nucleic Acids Research, 2001, Vol 29, No. 16 e82. You can download the software that runs on the site.

  • pa_breakcont Camlp4 3.10 macro for adding break and continue to OCaml loops.

  • 3.10 pa_bounds Camlp4 3.10 compatible release of Martin Jambon's pa_bounds macro. The original is here.

  • Cython support in Exuberant Ctags (the package continues as Universal Ctags)
  • usort Optimized C99 routines for sorting numeric data. Much faster than glibc qsort() for this task.

  • NPSML Naval Postgraduate School Machine Learning Library.

  • sql.el bug fix for Postgres support in GNU Emacs, credited in the sql.el comments.

  • scipy.stats.logcdf bug fix described above under "Notes".

  • cwd_jmp Bookmark relative paths in bash.

  • golang-gitoperations A Golang library for git automation.

  • AmIGitEnough A tool for onboarding team members into a rebase-oriented git workflow. I originally developed this inside Amazon where it was very popular at least until the time I exited in 2021. This package needs distribution maintainers for App Image and Brew. Contact me if interested in helping out (or just go for it!).

    c Andrew Schein. All rights reserved.
    Web design copied with permission from Na-Rae Han.