Orange Forum • View topic - Proposal for GSoC 2011

Proposal for GSoC 2011

General discussions about Orange and with Orange connected things (data mining, machine learning, bioinformatics...).

Proposal for GSoC 2011

Postby xmujay » Sun Apr 03, 2011 4:49

Hello, my name is Chen Xiajian. I come from Chinese Academy of Sciences. I am interested in the project Matrix factorization techniques for data mining.

I am looking forward to hear from you and you suggestion is helpfull to me
Thank you very much!

The following is the my proposal in Matrix factorization techniques for data mining
Name: Chen Xiajian
Email: xmujay@gmail.com
Project Title:
Matrix factorization techniques for data mining
Short description:
A large number of data mining tasks can be formulated as optimization problems. Examples include K-means clustering, support vector machines (SVM), linear discriminated analysis (LDA), and semi-supervised clustering. Nonnegative matrix factorization (NMF) has been shown to be able to solve several data mining problems including classification and clustering. I think that NMF provides an optimization framework which has a much broader applicability and can solve a variety of data mining problems.
NMF has been shown to be useful in a variety of applied settings, including environ metrics, chemometrics, pattern recognition, multimedia data analysis, text mining, and DNA gene expression analysis. Algorithmic extensions of NMF have been developed to accommodate a variety of objective functions and a variety of data analysis problems, including classification and collaborative filtering. It has been shown that NMF with the sum of squared error cost function is equivalent to a relaxed Kmeans clustering, the most widely used unsupervised learning algorithm. In addition, NMF with the I-divergence cost function is equivalent to probabilistic latent semantic indexing; another unsupervised learning method popularly used in text analysis. In this proposal, we show that NMF provides a nice framework for solving many data mining optimization problems. In particular, we broaden the scope of application to include several different data mining problems. We provide NMFinspired solutions to
 Multi-way normalized cut spectral clustering
 Graph matching
 Maximal clique and biclique
Benefits to the Orange (and/or other) project(s):
Matrix factorization techniques for data mining
Deliverables:
In particular, we broaden the scope of application to include several different data mining problems. We provide NMFinspired solutions to
 Multi-way normalized cut spectral clustering
 Graph matching
 Maximal clique and biclique
Project Details:
Table of Contents:
  Spectral clustering
  Graph matching
  Project Schedule
 My resume

1. Spectral clustering
Normalized Cuts is a NP-hard optimization problem. We focus on the multi-way version of Normalized Cuts, which can be formulated as the minimization of the following objective function:

If we ignore the nonnegative constraints, and keep the orthogonality intact, the solution for H is given by the generalized eigenvectors of D-W. However, the mixed signs of the eigenvector solutions make the cluster assignment difficult. Thus the nonnegativity constraints is the key.

2. Graph matching
Graph matching plays an important role in many data mining applications such as pattern recognition, information retrieval, and frequent pattern mining to determine correspondences between the components (vertices and edges) of two attributed structures. Given two graphs with adjacency matrices A and B, the graph matching problem is to permute the nodes of A such that:
is minimized, where P is a permutation matrix.

Project Schedule :
Task Time Needed Description
Learning time Two weeks or less Learn more about algorithm
Read source code of preview module in the instruction of Orange in depth.
Design the data structure
Implementation Time One month Implement the algorithm with the guidance of mentor.
Debug Time Two weeks Test to public(if possible) or in private
Doc Time Two weeks or less Docs about what GsoC needed.
If time permit, doc about Algorithm should be written.
Refactor Time Two weeks or during the whole time of the GSoc Refactor preview module with mentor.

Re: Proposal for GSoC 2011

Postby Mitar » Sun Apr 03, 2011 9:40

You should submit your proposal through GSoC page, not to this forum.

Please also do not post your whole proposals as we do not have time to read all proposals from everybody in advance. It also removes the whole idea of proposals, because if we will help everybody working on their proposals, on which exactly basis we will rank the students? This is also something which should show that you are capable of independent work. You should do the research of what are possibilities, propose your solutions with pros and cons, with arguments and references.

But of course you are free to ask specific questions about your ideas what you would do. We are glad to help to make your and ours understanding of what exactly the proposed ideas mean. But not the general questions like "is this a good proposal". So, are in your proposal any things you are unsure of?

Re: Proposal for GSoC 2011

Postby Blaz » Sun Apr 03, 2011 12:38

Dear Chen,

This looks ok, but like Mitar said, submit your proposal officially, not through this forum. As you are familiar with the literature from this subject, you can perhaps refer to selected papers that describe the actual algorithms you would like to implement.

One more thought: the idea of this project is to implement in Orange the principal MF techniques, that are otherwise already implemented in some other packages. Do not forget that the project has also an "educational" component: besides implementation of most useful MF techiques we would like to provide documentation that (most importantly) includes cases (code + data) that show their utility. You will see that entire current documentation of Orange is structured in this way. So rather than implementing too many methods, implement a few, but for those do thorough testing, provide nice documentation and show examples of use.

Re: Proposal for GSoC 2011

Postby xmujay » Mon Apr 04, 2011 13:16

OK thank you all


Return to General Discussions