wiki:GSoC/Ideas

Version 60 (modified by crt, 2 years ago) (diff)

Google Summer of Code Ideas

Here is a list of ideas for projects we thought up for what would be interesting and useful to do in a course of Google Summer of Code program for Orange. Of course you can propose also some other (your) idea(s). But of course connected with Orange, data mining, machine learning, artificial intelligence in general, bioinformatics and other fields we are interested in (or you can get us interested in).

You can find more information about our participation in Google Summer of Code here.

Ideas are listed in no particular order.

Open ideas

Text mining add-on for Orange

Current  Orange add-on for text mining is outdated and incomplete. Source code needs rafactoring in order to be compliant with Orange 2.5 development guidelines. Additionally, current text mining add-on lacks of documentation in reST format (including tutorial for beginners), unit tests and installation supported by PyPI ( http://pypi.python.org/pypi).

Review existing text preprocessing methods (lemmatization and steaming) in orngText and propose improvements,

Useful skills: Python. Data mining. Level from 1 (beginner) to 5 (professional): 4 Possible mentors: Črt

Support for parallel computation for scripting/backend

The project will develop the support for (semi)automatic parallelisation/separation into processes, and possible also distribution of processes over different computers. For example,  cross-validation with multiple folds is one simple example of easy parallelized technique, as each fold can be independently computed and then easily combined into the final result. Parallelization should be seamless, from the point of view of the user – script writer.

It would be good to analyze such opportunities for parallelization, find what they have in common and maybe devise a small helper library (possibly a wrapper for some existing grid computing system) to use in code to easily make it run in parallel, if such environment is available, and run normally if not. And the of course move as much of already existing implementations to this new support for parallelization.

Useful skills: Python. Grid computing experience.

Level from 1 (beginner) to 5 (professional): 4

Possible mentors: Anže

Test scripts, example scripts and documentation

Orange comes with substantial documentation for scripting which, in places, could be substantially improved. Also, Orange 2.5 with its new class hierarchy and functions is coming, and code snippets and corresponding documentation would both require a revision. The project would embark in design of new use cases (snippets of code to demonstrate various aspects of orange), review of present set of snippets, and integration of code snippets within the documentation.

Snippets in documentation also serve as regression scripts upon which Orange is tested daily. Another purpose of this project is to increase the number and coverage of regression scripts.

This could be also a good project if you would like to learn more about Orange, data mining and machine learning itself.

Useful skills: Proficiency in English (probably native speaker). Language/writing skills. Python.

Level from 1 (beginner) to 5 (professional): 3

Possible mentors: Blaž

A social platform for Orange

Orange's visual programming environment can incorporate any of its over a 100 widgets into schemas that do clustering, classification, visualization-based analysis, PCA, and many others. A repository of typical schemas would be most welcome: the novice users could choose from the already defined schemas and analyze their own data with them, the intermediate users could use the library to improve/augment their own schemas, and experienced users would be able to store their inventions into a repository for others to use. Orange could also train a widget recommendation system from the schemas in the repository. The repository could feature tagging, liking, commenting, and everything that a social platform can provide. Schemas in the repository could be described in text or video.

We would also like to provide users a way to upload their datasets and scripts/code-snippets. The later in a similar way as  Gist, but probably based on Mercurial (or not).

Users could follow each other and their contributions (similar to  Project Noah), use OpenID to login, have their profiles and so on.

In summary, the project would develop a new social platform for data mining solutions, possibly relying on existing solution (e.g.  myExperiment) or crafting something new (and simpler?). The repository would feature web access, but schemas from it should also be available in Orange (browsing, uploading and downloading). Seamless integration of repository and Orange is crucial to the success of this project (the Orange part of this integration will be done by someone from the laboratory, student will just have to define a simple HTTP-based API/protocol for it and implement it on server side).

Useful skills: Knowledge on how to develop a social platform (possibly Django). Python programming. Some knowledge of data mining/machine learning would help as well.

Level from 1 (beginner) to 5 (professional): 4.5

Possible mentors: Blaž, Mitar

Repository for add-ons

Orange supports add-ons which can add new features to scripting and new widgets (GUI). Currently, this feature is highly underused and used only for few internally developed add-ons. It would be great to open this in such way that also contributors around the world would be able to submit their add-ons to some central repository from which would then be possible install/use add-ons into Orange.

It would be good to try to integrate this with existing technologies and portals ( Bitbucket,  GitHub,  Python Package Index).

Useful skills: Python (along with distutils, python package directory structure rules and packaging skills).

Level from 1 (beginner) to 5 (professional): 3

Possible mentors: Matija, Mitar

Widgets in separate processes

Widgets in Orange Canvas currently run in a single process. As they are independent given their inputs, they could frequently work in parallel (in a  data-flow manner). The objective of this task would be to modify Orange Canvas so that each widget would run in its own process.

It would be also useful to separate GUI thread from main payload computation of widgets. Currently we are using also just one thread for everything (GUI thread) and we have, while widget is working, to repeatedly callback into the GUI to make it responsive. It would be great to have this separated so that code would be cleaner.

Useful skills: Python programming with multiple processes and threads. Qt and PyQt experience. Program design.

Level from 1 (beginner) to 5 (professional): 5

Possible mentors: Marko

Bridge between Orange and R

 R contains many great methods/tools which would be also very useful in Orange. To prevent duplication of work (and implementation) it would be great to be able to use those methods/tools directly in Orange (so that it is not necessary to reimplement them in Orange).

The idea is to research possibilities for this and then implement a future-proof bridge between Orange and R.

Useful skills: Python. C/C++. Experience with R. Experience with program-to-program interfaces.

Level from 1 (beginner) to 5 (professional): 4

Possible mentors: Jure

Time-series analysis

Orange currently lacks any  time-series analysis tools. It would be great to develop some basic tools for dealing with them: reading, normalizing, basic pattern search, feature extraction, some (auto-)correlation and similar basic techniques, and so on. Research what other similar applications support and propose which features would be useful to have as a basic set of tools.

Important is to implement (in a modular way) feature extraction from time-series analysis so that it can be integrated with the rest of Orange (and learning, classification and visualization tools already existing there).

Disclaimer: We do not have any experts on time-series analysis in a laboratory so student will have to be independent and self-learning about this. Mentor will provide some guidance and help with integration into Orange.

Useful skills: Python. Data analysis experience. Digital signal processing experience could also help.

Level from 1 (beginner) to 5 (professional): 4

Possible mentors: Mitar

Porting Python code to Orange 3.0

While migrating to Python 3.0 that broke compatibility with older versions of Python, we decided to seize the opportunity to clean up our house, too. A majority of Orange's C++ code has been rewritten, and while most functionality is still there, classes have been renamed, some have been eliminated, function arguments have been cleaned up and so forth. Now we would need to correspondingly change the parts in Python. This would require some routine refactoring and so forth, and also reimplementing some functionality that used to be in C++ but should be moved to Python. At the same time, we would need tools to make this process as automated as possible (an Orange equivalent of 2to3 script for Python).

Note that this is not about porting Orange from Python 2.X to Py3K: this is trivial and can be done in one evening (we tried it). The work ranges from running 2to3 to redesigning some architectural parts, so the student will have to be in constant contact with the core group.

Required skills: good knowledge of Orange and Python

Level from 1 (beginner) to 5 (professional): 4

Possible mentors: Janez

Widgets for statistics

Orange is rather weak in basic statistics, from various statistical tests to linear regression, dimensionality reduction and so forth. It would be great to have some widgets for this. The code for computation of all this is already available in other libraries which we can call from Python, so what we actually need is a good integration within the canvas.

Level from 1 (beginner) to 5 (professional): 3.5

Possible mentors: Janez

biox library (NGS, next-generation sequencing)

Orange already offers the Bioinformatics add-on but currently lacks tools for NGS (next-generation sequencing) data management and analysis. We suggest developing Python library biox (also by integrating existing state-of-the-art software) to be used in Orange.

Short description of project tasks:

  • develop support for reading/writing/searching the most used bioinformatics file formats: fasta, fastq, bed, wig, bigWig, gtf, gff3, bedGraph. Carefully craft memory efficient representations of various features (if needed, represent features in C and connect with Python),
  • develop simple (programmatically easy to use) wrappers for existing NGS open source software solutions such as: read quality analysis (e.g. FASTQC), mapping of reads to reference genomes (e.g.: bowtie, bowtie2, tophat), differential expression analysis (e.g.: DESeq, baySeq),
  • where needed, various tools should be able to produce statistical reports in text and also graphical format (matplotlib).

Level from 1 (beginner) to 5 (professional): 5

Possible mentors: Gregor, Tomaz, Crt

Ideas selected for GSoC 2011

Replacing PyQwt with pure PyQt visualizations

Many visualizations in Orange widgets currently use PyQwt. It seems a good idea to migrate to pure Qt implementation, for several reasons:

  • PyQwt development seems stalled. The current version on the site is for Python 2.6 and Qt 4.5, and although Python 3.X is said to be supported, we have not been unable to build it. While Orange itself is basically ported to Python 3.X, PyQwt is a show stopper. We have also had this same problem with previous Python version upgrades.
  • Qwt is not very estaetically pleasing: it is a very good tool for plotting the data, but not for publishing pictures in glossy journals and web sites. ;) The new Qt graphics classes would do a much better job.
  • We are not using much of Qwt, we need only some basic stuff, which should be easy to reimplement.

Given all this, it would make little sense for the Orange team to take over the maintenance of Qwt-to-Python interface.

Fortunately, most widgets do not interact with PyQwt directly but instead use a middle layer, OWGraph, which is a part of Orange. The "toughest" part will be to reimplement the Qwt's classes for drawing curves, which need to be in C++ (with a sip interface to Python).

Useful skills: Python, C++,  sip,  Qt.

Level from 1 (beginner) to 5 (professional): 4.5

Possible mentors: Miha, Janez

Matrix factorization techniques for data mining

Matrix factorization is a fundamental building block for many of current data mining approaches. Factorization techniques, like non-negative and probabilistic sparse matrix factorizations are today widely used in various applications of data mining. The aim of this project is to develop a scripting library for Orange that includes various matrix factorization techniques, and in addition provides documentation of the code, working examples that demonstrate various types of applications. Selected examples are to be included in the documentation. The entire development is therefore oriented for creation of the scripting library, that is, the project would not involve any widget programming. We would, though, like to have the student sketch how should the widgets that use this library look like, which methods developed in the scripting library should they access, and which (if any) are useful visualizations to be implemented.

Useful skills: Python. Matrix operations (possibly in numpy). Good background in math, linear algebra and optimization.

Level from 1 (beginner) to 5 (professional): 4

Possible mentors: Blaž

Animations in Orange

Data visualization plays a very important role in understanding relationships from the data. Unfortunately, it is usually limited to two dimensions (e.g. scatter plot), additional information about the data can be presented by different colors, sizes and shapes of the points. There can be, however, additional variables in data (e.g. time) which can have a strong influence on the scatter plot. Like in  Gapminder time can be used as an "animation" variable. One could play animations and see how the scatter plot changes during the time (or any other continuous variable from the data).

Useful skills: Python. Widgets programming.

Level from 1 (beginner) to 5 (professional): 4

Possible mentors: ?