wiki:MatrixFactorization

Matrix Factorization Techniques for Data Mining

Unified and efficient interface to matrix factorization algorithms and methods. A scripting library which includs a number of published factorization algorithms and initialization methods and facilitates the combination of these to produce new strategies.

Extensive documentation with working examples which demonstrate real applications, commonly used benchmark data and visualization methods are provided to help with the interpretation and comprehension of the results.

Overview

List of algorithms for matrix factorization.

Additional: snmnmf

Sparse network regularized multiple NMF.
Reference: (Zhang, 2011)

Additional: bmf

Binary NMF. Extend standard NMF to BMF; given a binary matrix X, factorize X into two binary matrices W, H. Two versions are possible: penalty function algorithm and thresholding algorithm. Example usages are document clustering, identification of bicluster structures in gene expression datasets, etc.
Reference: (Zhang, 2007).

nmf

Standard NMF based on Kullbach Leibler divergence, simple multiplicative updates, enhanced to avoid numerical overflow.
Reference: (Brunet, 2004).

s-nmf (s-nmf/r and s-nmf/l)

Sparse NMF based on alternating non-negativity constrained least squares, solved by a fast non-negativity constrained least squares. Sparseness imposed on the left, right factor. It is meant to be very fast compared to other approaches.
Reference: (Kim, 2007).

ns-nmf

Non-smooth NMF. Uses a modified version of Lee and Seung's multiplicative updates for Kullbach-Leibler divergence. It is meant to give sparser results.
Reference: (Pascual-Montano, 2006).

l-fnmf

Local Fisher NMF. Add the Fisher constraint to maximize the between-class scatter and minimize the within-class scatter.
Reference: (Wang, 2004).

ls-nmf

Alternating non-negative least squares MF using the projected gradient method for each subproblem.
Reference: (Lin, 2007).

pmf

Probabilistic MF. Target matrix is interpreted as samples from multinomial.
Reference: (Hansen, 2008), (Laurberg, 2008).

psmf

Probabilistic Sparse MF. PSMF allows for varying levels of sensor noise, uncertainty in the hidden prototypes used to explain the data and uncertainty as to the prototypes selected to explain each data vector.
Reference: (Dueck, 2005).

bd

Bayesian decomposition. A Bayesian treatment of NMF, based on a normal likelihood and exponential priors, MCMC sampling method (Gibbs) to approximate the posterior density.
Reference: (Schmidt, 2009).

icm

Bayesian model - Iterated conditional modes (ICM) algorithm. The modes of conditional densities have closed form expressions and ICM algorithm has a low computational cost per iteration.
Reference: (Schmidt, 2009).

List of algorithms for matrix factorization initialization (currently).

  • random
  • fixed
  • nndsvd (Boutsidis, 2007)
  • random C (Albright, 2006)
  • random Vcol (Albright, 2006)
  • pmf
  • algorithm specific

Common testing data sets:

  • synthetic data sets (ensure correctness of implemented methods)
  • Leukemia data set (AML/ALL)
  • Medulloblastoma data set
  • Central nervous system tumors
  • etc.

Timeline

Milestone: MF 0.05

April 20 – May 23 (Before the official coding time)

  • To do some self coding to improve my further understanding of techniques.
  • I will become absolutely clear about my future implementations, design and approaches I will follow.
  • Discuss design and interface of scripting library to: integrate well with Orange framework, assure scalability for future added factorizations or visualization support.

Milestone: MF 0.1

May 23 – June 18 (Official coding period starts)

  • Design and implement interface to perform all algorithms and combine them with initialization methods and extensions.
  • Implementing family of NMF techniques: NMF, nsNMF, lsNMF.

Milestone: MF 0.5

June 18 – July 5

  • Implementing family of NMF techniques: sNMF, lfNMF, PMF.

July 5 – July 15

  • Extend NMF implementations with various extensions, additive and multiplicative update rules and initialization methods.
  • Provide factorization quality measures: measure based on cophenetic correlation coefficient, sparseness, dispersion, residuals.

Milestone: MF 1.0

July 15 – July 25

  • Implement Bayesian methods. Bayesian decomposition using Gibbs sampler (BD).
  • Handling PMF on large, sparse and unbalanced datasets (algorithm for probabilistic sparse matrix factorization (PSMF)).

July 25 – July 31

  • Implement Bayesian model - Iterated conditional modes algorithm (ICM).
  • Improve efficieny of the code, bug removal, exception handling, additional testing.

August 1 – August 15

  • For Documentation.
  • Devise working examples that demonstrate various types of applications.

August 15 – August 22

  • A buffer for unpredictable delay.

Milestone: MF Future 1.1

  • Extend Bayesian methods (variational BD, linearly constrained BD).
  • Adapt PMF model to the interval-valued matrices and implement Interval-valued PMF (I-PMF) and Interval-valued NMF (I-NMF) (Shen, 2010).
  • Nonnegative matrix approximation. Method for dimensionality reduction with respect on the nonnegativity of input data. Multiplicative iterative scheme (Sra, 2006).

Source

Source code is published at  GitHub repository.

ssh | http:  git@github.com:marinkaz/MF.git |  https://marinkaz@github.com/marinkaz/MF.git

Documentation

Built documentation is available at  MF site. Documentation sources are avaiable at  GitHub repository.

References

  • Brunet, J. P., Tamayo, P., Golub, T. R., Mesirov, J. P. Metagenes and molecular pattern discovery using matrix factorization. Proc Natl Acad Sci USA, 2004, 101(12), 4164–4169.
  • Lin, C.J. (2007). Projected gradient methods for nonnegative matrix factorization. Neural computation, 19(10), 2756-79. doi: 10.1162/neco.2007.19.10.2756.
  • Kim, H., Park, H. Sparse non-negative matrix factorizations via alternating non-negativity-constrained least squares for microarray data analysis. Bioinformatics (Oxford, England), 2007, 23(12), 1495–502.
  • Pascual-Montano, A., Carazo, J. M., Kochi, K., Lehmann, D., Pascual-Marqui, R. D. Nonsmooth nonnegative matrix factorization (nsnmf). IEEE transactions on pattern analysis and machine intelligence, 2006 28(3), 403–415.
  • Wang, Y., Turk, M. Fisher non-negative matrix factorization for learning local features, 2004.
  • Salakhutdinov, R., Mnih, A. Probabilistic Matrix Factorization. NIPS 2008.
  • Sra, S., Dhillon, I. S. Nonnegative Matrix Approximation : Algorithms and Applications. Sciences-New York, 2006, 1–36.
  • Zhiyong Shen, Liang Du, Xukun Shen, Yidong Shen, Interval-valued Matrix Factorization with Applications, Data Mining, IEEE International Conference on, pp. 1037–1042, 2010 IEEE International Conference on Data Mining, 2010.
  • Dueck, D., Morris, Q. D., Frey, B. J. Multi-way clustering of microarray data using probabilistic sparse matrix factorization. Bioinformatics (Oxford, England), 21 Suppl 1, 2005, 144–51.
  • Schmidt, M.N., Winther, O., and Hansen, L.K., (2009). Bayesian Non-negative Matrix Factorization. In Proceedings of ICA. 2009, 540-547.
  • Ochs, M. F., Kossenkov A. V. NIH Public Access. Methods, Methods Enzymol., 2009, 59–77.
  • Zhang, Z., Li, T., Ding, C. H. Q., Zhang, X. Binary Matrix Factorization with Applications. ICDM 2007.
  • Salakhutdinov, R., Mnih, A. Bayesian probabilistic matrix factorization using Markov chain Monte Carlo. ICML 2008, 880-887.
  • Laurberg, H.,et. al. Theorems on positive data: on the uniqueness of NMF. Computational intelligence and neuroscience. 2008.
  • Hansen, L. K. Generalization in high-dimensional factor models. 2008. Web:  http://www.stanford.edu/group/mmds/slides2008/hansen.pdf.
  • Albright, R. et al., (2006). Algorithms, initializations, and convergence for the nonnegative matrix factorization. Matrix, (919), p.1-18.
  • Boutsidis, C., Gallopoulos, E., (2007). SVD-based initialization: A head start for nonnegative matrix factorization, Pattern Recognition, 2007.
  • Zhang, S. et. al., (2011). A novel computational framework for simultaneous integration of multiple types of genomic data to identify microRNA-gene regulatory modules. Bioinformatics 2011, 27(13), i401-i409. doi:10.1093/bioinformatics/btr206.