## Proposal for GSoC 2011

4 posts
• Page

**1**of**1**### Proposal for GSoC 2011

Hello, my name is Chen Xiajian. I come from Chinese Academy of Sciences. I am interested in the project Matrix factorization techniques for data mining.

I am looking forward to hear from you and you suggestion is helpfull to me

Thank you very much!

The following is the my proposal in Matrix factorization techniques for data mining

Name: Chen Xiajian

Email: xmujay@gmail.com

Project Title:

Matrix factorization techniques for data mining

Short description:

A large number of data mining tasks can be formulated as optimization problems. Examples include K-means clustering, support vector machines (SVM), linear discriminated analysis (LDA), and semi-supervised clustering. Nonnegative matrix factorization (NMF) has been shown to be able to solve several data mining problems including classification and clustering. I think that NMF provides an optimization framework which has a much broader applicability and can solve a variety of data mining problems.

NMF has been shown to be useful in a variety of applied settings, including environ metrics, chemometrics, pattern recognition, multimedia data analysis, text mining, and DNA gene expression analysis. Algorithmic extensions of NMF have been developed to accommodate a variety of objective functions and a variety of data analysis problems, including classification and collaborative filtering. It has been shown that NMF with the sum of squared error cost function is equivalent to a relaxed Kmeans clustering, the most widely used unsupervised learning algorithm. In addition, NMF with the I-divergence cost function is equivalent to probabilistic latent semantic indexing; another unsupervised learning method popularly used in text analysis. In this proposal, we show that NMF provides a nice framework for solving many data mining optimization problems. In particular, we broaden the scope of application to include several different data mining problems. We provide NMFinspired solutions to

Multi-way normalized cut spectral clustering

Graph matching

Maximal clique and biclique

Benefits to the Orange (and/or other) project(s):

Matrix factorization techniques for data mining

Deliverables:

In particular, we broaden the scope of application to include several different data mining problems. We provide NMFinspired solutions to

Multi-way normalized cut spectral clustering

Graph matching

Maximal clique and biclique

Project Details:

Table of Contents:

Spectral clustering

Graph matching

Project Schedule

My resume

1. Spectral clustering

Normalized Cuts is a NP-hard optimization problem. We focus on the multi-way version of Normalized Cuts, which can be formulated as the minimization of the following objective function:

If we ignore the nonnegative constraints, and keep the orthogonality intact, the solution for H is given by the generalized eigenvectors of D-W. However, the mixed signs of the eigenvector solutions make the cluster assignment difficult. Thus the nonnegativity constraints is the key.

2. Graph matching

Graph matching plays an important role in many data mining applications such as pattern recognition, information retrieval, and frequent pattern mining to determine correspondences between the components (vertices and edges) of two attributed structures. Given two graphs with adjacency matrices A and B, the graph matching problem is to permute the nodes of A such that:

is minimized, where P is a permutation matrix.

Project Schedule :

Task Time Needed Description

Learning time Two weeks or less Learn more about algorithm

Read source code of preview module in the instruction of Orange in depth.

Design the data structure

Implementation Time One month Implement the algorithm with the guidance of mentor.

Debug Time Two weeks Test to public(if possible) or in private

Doc Time Two weeks or less Docs about what GsoC needed.

If time permit, doc about Algorithm should be written.

Refactor Time Two weeks or during the whole time of the GSoc Refactor preview module with mentor.

I am looking forward to hear from you and you suggestion is helpfull to me

Thank you very much!

The following is the my proposal in Matrix factorization techniques for data mining

Name: Chen Xiajian

Email: xmujay@gmail.com

Project Title:

Matrix factorization techniques for data mining

Short description:

A large number of data mining tasks can be formulated as optimization problems. Examples include K-means clustering, support vector machines (SVM), linear discriminated analysis (LDA), and semi-supervised clustering. Nonnegative matrix factorization (NMF) has been shown to be able to solve several data mining problems including classification and clustering. I think that NMF provides an optimization framework which has a much broader applicability and can solve a variety of data mining problems.

NMF has been shown to be useful in a variety of applied settings, including environ metrics, chemometrics, pattern recognition, multimedia data analysis, text mining, and DNA gene expression analysis. Algorithmic extensions of NMF have been developed to accommodate a variety of objective functions and a variety of data analysis problems, including classification and collaborative filtering. It has been shown that NMF with the sum of squared error cost function is equivalent to a relaxed Kmeans clustering, the most widely used unsupervised learning algorithm. In addition, NMF with the I-divergence cost function is equivalent to probabilistic latent semantic indexing; another unsupervised learning method popularly used in text analysis. In this proposal, we show that NMF provides a nice framework for solving many data mining optimization problems. In particular, we broaden the scope of application to include several different data mining problems. We provide NMFinspired solutions to

Multi-way normalized cut spectral clustering

Graph matching

Maximal clique and biclique

Benefits to the Orange (and/or other) project(s):

Matrix factorization techniques for data mining

Deliverables:

In particular, we broaden the scope of application to include several different data mining problems. We provide NMFinspired solutions to

Multi-way normalized cut spectral clustering

Graph matching

Maximal clique and biclique

Project Details:

Table of Contents:

Spectral clustering

Graph matching

Project Schedule

My resume

1. Spectral clustering

Normalized Cuts is a NP-hard optimization problem. We focus on the multi-way version of Normalized Cuts, which can be formulated as the minimization of the following objective function:

If we ignore the nonnegative constraints, and keep the orthogonality intact, the solution for H is given by the generalized eigenvectors of D-W. However, the mixed signs of the eigenvector solutions make the cluster assignment difficult. Thus the nonnegativity constraints is the key.

2. Graph matching

Graph matching plays an important role in many data mining applications such as pattern recognition, information retrieval, and frequent pattern mining to determine correspondences between the components (vertices and edges) of two attributed structures. Given two graphs with adjacency matrices A and B, the graph matching problem is to permute the nodes of A such that:

is minimized, where P is a permutation matrix.

Project Schedule :

Task Time Needed Description

Learning time Two weeks or less Learn more about algorithm

Read source code of preview module in the instruction of Orange in depth.

Design the data structure

Implementation Time One month Implement the algorithm with the guidance of mentor.

Debug Time Two weeks Test to public(if possible) or in private

Doc Time Two weeks or less Docs about what GsoC needed.

If time permit, doc about Algorithm should be written.

Refactor Time Two weeks or during the whole time of the GSoc Refactor preview module with mentor.

### Re: Proposal for GSoC 2011

You should submit your proposal through GSoC page, not to this forum.

Please also do not post your whole proposals as we do not have time to read all proposals from everybody in advance. It also removes the whole idea of proposals, because if we will help everybody working on their proposals, on which exactly basis we will rank the students? This is also something which should show that you are capable of independent work. You should do the research of what are possibilities, propose your solutions with pros and cons, with arguments and references.

But of course you are free to ask specific questions about your ideas what you would do. We are glad to help to make your and ours understanding of what exactly the proposed ideas mean. But not the general questions like "is this a good proposal". So, are in your proposal any things you are unsure of?

Please also do not post your whole proposals as we do not have time to read all proposals from everybody in advance. It also removes the whole idea of proposals, because if we will help everybody working on their proposals, on which exactly basis we will rank the students? This is also something which should show that you are capable of independent work. You should do the research of what are possibilities, propose your solutions with pros and cons, with arguments and references.

But of course you are free to ask specific questions about your ideas what you would do. We are glad to help to make your and ours understanding of what exactly the proposed ideas mean. But not the general questions like "is this a good proposal". So, are in your proposal any things you are unsure of?

### Re: Proposal for GSoC 2011

Dear Chen,

This looks ok, but like Mitar said, submit your proposal officially, not through this forum. As you are familiar with the literature from this subject, you can perhaps refer to selected papers that describe the actual algorithms you would like to implement.

One more thought: the idea of this project is to implement in Orange the principal MF techniques, that are otherwise already implemented in some other packages. Do not forget that the project has also an "educational" component: besides implementation of most useful MF techiques we would like to provide documentation that (most importantly) includes cases (code + data) that show their utility. You will see that entire current documentation of Orange is structured in this way. So rather than implementing too many methods, implement a few, but for those do thorough testing, provide nice documentation and show examples of use.

This looks ok, but like Mitar said, submit your proposal officially, not through this forum. As you are familiar with the literature from this subject, you can perhaps refer to selected papers that describe the actual algorithms you would like to implement.

One more thought: the idea of this project is to implement in Orange the principal MF techniques, that are otherwise already implemented in some other packages. Do not forget that the project has also an "educational" component: besides implementation of most useful MF techiques we would like to provide documentation that (most importantly) includes cases (code + data) that show their utility. You will see that entire current documentation of Orange is structured in this way. So rather than implementing too many methods, implement a few, but for those do thorough testing, provide nice documentation and show examples of use.

4 posts
• Page

**1**of**1**