wiki:MatrixFactorization

Version 7 (modified by MarinkaZitnik, 3 years ago) (diff)

Matrix Factorization Techniques for Data Mining

Unified and efficient interface to matrix factorization algorithms and methods. A scripting library which includs a number of published factorization algorithms and initialization methods and facilitates the combination of these to produce new strategies.

Extensive documentation with working examples which demonstrate real applications, commonly used benchmark data and visualization methods are provided to help with the interpretation and comprehension of the results.

Overview

List of algorithms for matrix factorization.

nmf

Standard NMF based on Kullbach Leibler divergence, simple multiplicative updates, enhanced to avoid numerical overflow.
Reference: (Brunet, 2004).

s-nmf

Sparse NMF based on alternating non-negativity constrained least squares, solved by a fast non-negativity constrained least squares. Sparseness imposed on the left, right factor.
Reference: (Kim, 2007).

ns-nmf

Non-smooth NMF. Uses a modifed version of Lee and Seung's multiplicative updates for Kullbach-Leibler divergence. It is meant to give sparser results.
Reference: (Pascual-Montano, 2006).

l-nmf

Local Fisher NMF. Add the Fisher constraint to maximize the between-class scatter and minimize the within-class scatter.
Reference: (Wang, 2004).

ls-nmf

Alternating Least Square NMF. It is meant to be very fast compared to other approaches.
Reference: (Kim, 2007).

pmf

Probabilistic MF. Model which scales linearly with the number of observations and performs well on large, sparse, imbalanced datasets.
Reference: (Salakhutdinov, 2007).

psmf

Probabilistic Sparse MF. PSMF allows for varying levels of sensor noise, uncertainty in the hidden prototypes used to explain the data and uncertainty as to the prototypes selected to explain each data vector.
Reference: (Dueck, 2005).

bd

Bayesian decomposition. A Bayesian treatment of NMF, based on a normal likelihood and exponential priors, MCMC sampling method (Gibbs) to approximate the posterior density.
Reference: (Schmidt, 2003).

icm

Bayesian model - Iterated conditional modes (ICM) algorithm. The modes of conditional densities have closed form expressions and ICM algorithm has a low computational cost per iteration.
Reference: (Schmidt, 2003).

Common testing data sets:

  • Leukemia data set (AML/ALL)
  • Medulloblastoma data set
  • Central nervous system tumors

Timeline

Milestone: MF 0.05

April 20 – May 23 (Before the official coding time)

  • To do some self coding to improve my further understanding of techniques.
  • I will become absolutely clear about my future implementations, design and approaches I will follow.
  • Discuss design and interface of scripting library to: integrate well with Orange framework, assure scalability for future added factorizations or visualization support.

Milestone: MF 0.1

May 23 – June 18 (Official coding period starts)

  • Design and implement interface to perform all algorithms and combine them with initialization methods and extensions.
  • Implementing family of NMF techniques: NMF, nsNMF, lsNMF.

Milestone: MF 0.5

June 18 – July 5

  • Implementing family of NMF techniques: sNMF, lNMF, PMF.

July 5 – July 15

  • Extend NMF implementations with various extensions, additive and multiplicative update rules and initialization methods.
  • Provide factorization quality measures: measure based on cophenetic correlation coefficient, sparseness, dispersion, residuals.

Milestone: MF 1.0

July 15 – July 25

  • Implement Bayesian methods. Bayesian decomposition using Gibbs sampler (BD).
  • Handling PMF on large, sparse and unbalanced datasets (algorithm for probabilistic sparse matrix factorization (PSMF)).

July 25 – July 31

  • Implement Bayesian model - Iterated conditional modes algorithm (ICM).
  • Improve efficieny of the code, bug removal, exception handling, additional testing.

August 1 – August 15

  • For Documentation.
  • Devise working examples that demonstrate various types of applications.

August 15 – August 22

  • A buffer for unpredictable delay.

Milestone: MF Future 1.1

  • Extend Bayesian methods (variational BD, linearly constrained BD).
  • Adapt PMF model to the interval-valued matrices and implement Interval-valued PMF (I-PMF) and Interval-valued NMF (I-NMF) (Shen, 2010).
  • Nonnegative matrix approximation. Method for dimensionality reduction with respect on the nonnegativity of input data. Multiplicative iterative scheme (Sra, 2006).

References

  • Brunet, J. P., Tamayo, P., Golub, T. R., Mesirov, J. P. Metagenes and molecular pattern discovery using matrix factorization. Proc Natl Acad Sci USA, 2004, 101(12), 4164–4169.
  • Kim, H., Park, H. Sparse non-negative matrix factorizations via alternating non-negativity-constrained least squares for microarray data analysis. Bioinformatics (Oxford, England), 2007, 23(12), 1495–502.
  • Pascual-Montano, A., Carazo, J. M., Kochi, K., Lehmann, D., Pascual-Marqui, R. D. Nonsmooth nonnegative matrix factorization (nsnmf). IEEE transactions on pattern analysis and machine intelligence, 2006 28(3), 403–415.
  • Wang, Y., Turk, M. Fisher non-negative matrix factorization for learning local features, 2004.
  • Salakhutdinov, R., Mnih, A. Probabilistic Matrix Factorization. Advances in Neural Information Processing Systems 20 (NIPS 2007).
  • Sra, S., Dhillon, I. S. Nonnegative Matrix Approximation : Algorithms and Applications. Sciences-New York, 2006, 1–36.
  • Zhiyong Shen, Liang Du, Xukun Shen, Yidong Shen, Interval-valued Matrix Factorization with Applications, Data Mining, IEEE International Conference on, pp. 1037–1042, 2010 IEEE International Conference on Data Mining, 2010.
  • Dueck, D., Morris, Q. D., Frey, B. J. Multi-way clustering of microarray data using probabilistic sparse matrix factorization. Bioinformatics (Oxford, England), 21 Suppl 1, 2005, 144–51.
  • Schmidt, M. N., Winther, O., Kai Hansen, L. Bayesian non-negative matrix factorization. Bayesian Statistics 7 (Oxford), 2003.
  • Ochs, M. F., Kossenkov A. V. NIH Public Access. Methods, Methods Enzymol., 2009, 59–77.