Andrew I. Schein, Ph.D.

Contact Information


  • Andrew I. Schein. Notes on the CROC Curve. Unpublished. [.ps.gz] [.pdf]


Invited Talks


  • Johnnie F. Caver. Novel Topic Impact on Author Attribution. Masters thesis in Computer Science. 2009. Outstanding thesis award. Co-advised with Craig Martell.
    Johnnie's thesis establishes a protocol for measuring author attribution degradation when predicting on novel topics (e.g. document topics not present in training data). Her evaluation task may be reproduced by downloading the NYT corpus, and filtering the corpus using her lists and text extractor.

Below are some of the software packages I have written that are released under various licenses.

  • PennAspect A java implementation of the Aspect model, a belief network that has prevailed in many communities under various names. In the natural language processing and data mining worlds the names "aspect model" or "probabilistic latent semantic indexing" are prevalent terms for this model. A third party has translated our code into C++ and incorporated it into the Lemur toolkit.

  • ROCtools includes ROC curves and the CROC curve variant for recommender system evaluation written in java. There are many ROC curve implementations out there. What makes this one different is that it can handle very large datasets. It lacks many of the common add-ons such as error-bars and curve smoothing.

  • Logistic PCA A principal component analysis technique for binary data. Implements the model-fitting strategy introduced in my paper, A generalized linear model for principal component analysis of binary data. The code is implemented as a Matlab procedure.

  • PCLR An algorithm that predicts protein localization to the chloroplasts in plants. A web-based version of the algorithm described in Nucleic Acids Research, 2001, Vol 29, No. 16 e82. You can download the software that runs on the site.

  • pa_breakcont Camlp4 3.10 macro for adding break and continue to OCaml loops.

  • 3.10 pa_bounds Camlp4 3.10 compatible release of Martin Jambon's pa_bounds macro. The original is here.

  • usort Optimized C99 routines for sorting numeric data. Much faster than glibc qsort() for this task.

  • NPSML Naval Postgraduate School Machine Learning Library.

  • cwd_jmp Bookmark relative paths in bash.

  • golang-gitoperations A Golang library for git automation.

    c Andrew Schein. All rights reserved.
    Web design copied with permission from Na-Rae Han.