Andrew I. Schein, Ph.D.


Contact Information
Publications

Notes

  • Andrew I. Schein. Notes on the CROC Curve. Unpublished. [.ps.gz] [.pdf]

Posters

Invited Talks

Advising

  • Johnnie F. Caver. Novel Topic Impact on Author Attribution. Masters thesis in Computer Science. 2009. Outstanding thesis award. Co-advised with Craig Martell.
    Johnnie's thesis establishes a protocol for measuring author attribution degradation when predicting on novel topics (e.g. document topics not present in training data). Her evaluation task may be reproduced by downloading the NYT corpus, and filtering the corpus using her lists and text extractor.
Software

Below are some of the software packages I have written and am giving away under relatively loose licensing terms, and in some cases under no restrictive terms. If you run across alternative implementations of these tools, please send me links and I will update these pages with this information. Please forgive me that I can not give step-by-step instructions to suit your particular needs.

  • PennAspect A java implementation of the Aspect model, a belief network that has prevailed in many communities under various names. In the natural language processing and data mining worlds the names "aspect model" or "probabilistic latent semantic indexing" are prevalent terms for this model. A third party has translated our code into C++ and incorporated it into the Lemur toolkit.

  • ROCtools includes ROC curves and the CROC curve variant for recommender system evaluation written in java. There are many ROC curve implementations out there. What makes this one different is that it can handle very large datasets. It lacks many of the common add-ons such as error-bars and curve smoothing.

  • Logistic PCA A principal component analysis technique for binary data. Implements the model-fitting strategy introduced in my paper, A generalized linear model for principal component analysis of binary data. The code is implemented as a Matlab procedure.

  • PCLR An algorithm that predicts protein localization to the chloroplasts in plants. A web-based version of the algorithm described in Nucleic Acids Research, 2001, Vol 29, No. 16 e82. You can download the software that runs on the site.

  • pa_breakcont Camlp4 3.10 macro for adding break and continue to OCaml loops.

  • 3.10 pa_bounds Camlp4 3.10 compatible release of Martin Jambon's pa_bounds macro. The original is here.

  • usort Optimized C99 routines for sorting numeric data. Much faster than glibc qsort() for this task.

  • NPSML Naval Postgraduate School Machine Learning Library.


    c 2005-2008 Andrew Schein. All rights reserved.
    Web design copied with permission from Na-Rae Han.