Version 13 (modified by MarinkaZitnik, 3 years ago) (diff) |
---|

# Matrix Factorization Techniques for Data Mining

Unified and efficient interface to matrix factorization algorithms and methods. A scripting library which includs a number of published factorization algorithms and initialization methods and facilitates the combination of these to produce new strategies.

Extensive documentation with working examples which demonstrate real applications, commonly used benchmark data and visualization methods are provided to help with the interpretation and comprehension of the results.

## Overview

List of algorithms for matrix factorization.

### Additional: bnmf

Binary NMF. Extend standard NMF to BMF; given a binary matrix X, factorize X into two binary matrices W, H. Two versions are possible: penalty function algorithm and thresholding algorithm. Example usages are document clustering, identification of bicluster structures in gene expression datasets, etc.

Reference: (Zhang, 2007).

### nmf

Standard NMF based on Kullbach Leibler divergence, simple multiplicative updates, enhanced to avoid numerical overflow.

Reference: (Brunet, 2004).

### s-nmf

Sparse NMF based on alternating non-negativity constrained least squares, solved by a fast non-negativity constrained least squares. Sparseness imposed on the left, right factor. It is meant to be very fast compared to other approaches.

Reference: (Kim, 2007).

### ns-nmf

Non-smooth NMF. Uses a modifed version of Lee and Seung's multiplicative updates for Kullbach-Leibler divergence. It is meant to give sparser results.

Reference: (Pascual-Montano, 2006).

### l-nmf

Local Fisher NMF. Add the Fisher constraint to maximize the between-class scatter and minimize the within-class scatter.

Reference: (Wang, 2004).

### ls-nmf

Alternating non-negative least squares MF using the projected gradient method for each subproblem.

Reference: (Lin, 2007).

### pmf

Probabilistic MF. Model which scales linearly with the number of observations and performs well on large, sparse, imbalanced datasets.

Reference: (Salakhutdinov, 2007).

### psmf

Probabilistic Sparse MF. PSMF allows for varying levels of sensor noise, uncertainty in the hidden prototypes used to explain the data and uncertainty as to the prototypes selected to explain each data vector.

Reference: (Dueck, 2005).

### bd

Bayesian decomposition. A Bayesian treatment of NMF, based on a normal likelihood and exponential priors, MCMC sampling method (Gibbs) to approximate the posterior density.

Reference: (Schmidt, 2003).

### icm

Bayesian model - Iterated conditional modes (ICM) algorithm. The modes of conditional densities have closed form expressions and ICM algorithm has a low computational cost per iteration.

Reference: (Schmidt, 2003).

## List of algorithms for matrix factorization initialization (currently).

- random
- fixed
- nndsvd
- algorithm specific

## Common testing data sets:

- synthetic data sets (ensure correctness of implemented methods)
- Leukemia data set (AML/ALL)
- Medulloblastoma data set
- Central nervous system tumors
- etc.

## Timeline

### Milestone: MF 0.05

**April 20 – May 23 (Before the official coding time)**

- To do some self coding to improve my further understanding of techniques.
- I will become absolutely clear about my future implementations, design and approaches I will follow.
- Discuss design and interface of scripting library to: integrate well with Orange framework, assure scalability for future added factorizations or visualization support.

### Milestone: MF 0.1

**May 23 – June 18 (Official coding period starts)**

- Design and implement interface to perform all algorithms and combine them with initialization methods and extensions.
- Implementing family of NMF techniques: NMF, nsNMF, lsNMF.

### Milestone: MF 0.5

**June 18 – July 5**

- Implementing family of NMF techniques: sNMF, lNMF, PMF.

**July 5 – July 15**

- Extend NMF implementations with various extensions, additive and multiplicative update rules and initialization methods.
- Provide factorization quality measures: measure based on cophenetic correlation coefficient, sparseness, dispersion, residuals.

### Milestone: MF 1.0

**July 15 – July 25**

- Implement Bayesian methods. Bayesian decomposition using Gibbs sampler (BD).
- Handling PMF on large, sparse and unbalanced datasets (algorithm for probabilistic sparse matrix factorization (PSMF)).

**July 25 – July 31**

- Implement Bayesian model - Iterated conditional modes algorithm (ICM).
- Improve efficieny of the code, bug removal, exception handling, additional testing.

**August 1 – August 15**

- For Documentation.
- Devise working examples that demonstrate various types of applications.

**August 15 – August 22**

- A buffer for unpredictable delay.

### Milestone: MF Future 1.1

- Extend Bayesian methods (variational BD, linearly constrained BD).
- Adapt PMF model to the interval-valued matrices and implement Interval-valued PMF (I-PMF) and Interval-valued NMF (I-NMF) (Shen, 2010).
- Nonnegative matrix approximation. Method for dimensionality reduction with respect on the nonnegativity of input data. Multiplicative iterative scheme (Sra, 2006).

## Source

Source code is published at GitHub repository.

ssh | http: git@github.com:marinkaz/MF.git | https://marinkaz@github.com/marinkaz/MF.git

## References

- Brunet, J. P., Tamayo, P., Golub, T. R., Mesirov, J. P. Metagenes and molecular pattern discovery using matrix factorization. Proc Natl Acad Sci USA, 2004, 101(12), 4164–4169.
- Lin, C.J. (2007). Projected gradient methods for nonnegative matrix factorization. Neural computation, 19(10), 2756-79. doi: 10.1162/neco.2007.19.10.2756.
- Kim, H., Park, H. Sparse non-negative matrix factorizations via alternating non-negativity-constrained least squares for microarray data analysis. Bioinformatics (Oxford, England), 2007, 23(12), 1495–502.
- Pascual-Montano, A., Carazo, J. M., Kochi, K., Lehmann, D., Pascual-Marqui, R. D. Nonsmooth nonnegative matrix factorization (nsnmf). IEEE transactions on pattern analysis and machine intelligence, 2006 28(3), 403–415.
- Wang, Y., Turk, M. Fisher non-negative matrix factorization for learning local features, 2004.
- Salakhutdinov, R., Mnih, A. Probabilistic Matrix Factorization. Advances in Neural Information Processing Systems 20 (NIPS 2007).
- Sra, S., Dhillon, I. S. Nonnegative Matrix Approximation : Algorithms and Applications. Sciences-New York, 2006, 1–36.
- Zhiyong Shen, Liang Du, Xukun Shen, Yidong Shen, Interval-valued Matrix Factorization with Applications, Data Mining, IEEE International Conference on, pp. 1037–1042, 2010 IEEE International Conference on Data Mining, 2010.
- Dueck, D., Morris, Q. D., Frey, B. J. Multi-way clustering of microarray data using probabilistic sparse matrix factorization. Bioinformatics (Oxford, England), 21 Suppl 1, 2005, 144–51.
- Schmidt, M. N., Winther, O., Kai Hansen, L. Bayesian non-negative matrix factorization. Bayesian Statistics 7 (Oxford), 2003.
- Ochs, M. F., Kossenkov A. V. NIH Public Access. Methods, Methods Enzymol., 2009, 59–77.
- Z. Zhang, T. Li, C. H. Q. Ding, X. Zhang: Binary Matrix Factorization with Applications. ICDM 2007.