wiki:GSoC/Ideas

Version 82 (modified by marko, 18 months ago) (diff)

Google Summer of Code Ideas

Here is a list of ideas for projects that might be interesting and useful to carry out during Google Summer of Code program for Orange. Your own ideas in methods for data analytics and visualization that would complement Orange are most welcome, too!

You can find more information about our participation in Google Summer of Code here.

Ideas are listed in no particular order.

Open ideas for 2013

Porting Python code to Orange 3.0

We are (still) migrating Orange to Python 3.0 and at the same time reimplementing Orange from scratch. We are ditching the C++ and relying on numpy, scipy.sparse, scikit-learn and similar libraries, and using Cython for anything that needs to be fast but is not provided elsewhere. We are looking for help in reimplementing various parts of Orange (note that this is not about porting Orange from Python 2.X to Py3K: this is trivial and can be done in one evening - we tried it). This project requires working closely (e.g. almost daily communication) with the core team.

Required skills: Good knowledge of Orange, Python and its libraries, and Cython.

Level from 1 (beginner) to 5 (professional): 4

Possible mentors: Janez, Marko

Parallel model evaluation

Most techniques for model evaluation are embarrassingly parallel: example bootstrap, cross validation, leave one out validation. Implement their parallel versions for Orange 3. Parallelization should be seamless, from the point of view of the user – script writer. This project requires working closely (e.g. almost daily communication) with the core team.

Required skills: Python, experience with multiprocessing.

Level from 1 (beginner) to 5 (professional): 4

Possible mentors: Marko

Neural Networks

Orange implements many algorithms for classification, but currently lacks support for neural network learning. The task consists of implementing neural networks (multilayer perceptron, convolutional and deep belief networks) in Python with numpy. Neural networks will be implemented as an addon to Orange.

Level from 1 (beginner) to 5 (professional): 4

Possible mentors: Jure

dictyExpress

The  dictyExpress is an interactive, web-based exploratory data analytics application for analysis of over 1,000 Dictyostelium gene expression experiments. The app is basically a web interface to Orange algorithms that run in the cloud. We are porting the obsolete Flash user interface to HTML5/JavaScript. This is an extremely multidisciplinary project that requires wide range of skills.

Useful skills: JavaScript, HTML5, CSS3, AngularJS, Bootstrap, Python, Django, basic bioinformatics knowledge.

Level from 1 (beginner) to 5 (professional): 4

Possible mentors: Miha

Widgets in separate processes

Widgets in Orange Canvas currently run in a single process. As they are independent given their inputs, they could frequently work in parallel (in a  data-flow manner). The objective of this task would be to modify Orange Canvas so that each widget could run in its own process.

It would be also useful to separate GUI thread from main payload computation of widgets. Currently we are using also just one thread for everything (GUI thread) and we have, while widget is working, to repeatedly callback into the GUI to make it responsive.

We would start by making a single widget able to run in its own process and then integrating it into the canvas. Afterwards, we would try to find a systematic way to branch off widgets into subprocesses with the least changes in the current code base.

Useful skills: Python programming with multiple processes and threads. Qt and PyQt experience.

Level from 1 (beginner) to 5 (professional): 5.

Possible mentors: Marko

Previous ideas

Data input from mldata.org

 mldata.org is an excellent machine learning data repository]. It would be great if Orange would have a script-based access to the repository that would also support querying (e.g., show me a list of all regression data sets with more than 100 features and 1000 data instances). Implementation of automatic querying and data download would also provide a basis for implementation of widget for browsing, searching and filtering of mldata.org data sets.  mldata.org at present does not feature programmatic access and querying, so carrying out this task may involve changing of  their code first.

Useful skills: Python.

Level from 1 (beginner) to 5 (professional): 4

Possible mentors: Blaž

Support for parallel computation for scripting/backend

The project will develop the support for (semi)automatic parallelisation/separation into processes, and possible also distribution of processes over different computers. For example,  cross-validation with multiple folds is one simple example of easy parallelized technique, as each fold can be independently computed and then easily combined into the final result. Parallelization should be seamless, from the point of view of the user – script writer.

It would be good to analyze such opportunities for parallelization, find what they have in common and maybe devise a small helper library (possibly a wrapper for some existing grid computing system) to use in code to easily make it run in parallel, if such environment is available, and run normally if not. And the of course move as much of already existing implementations to this new support for parallelization.

Useful skills: Python. Grid computing experience.

Level from 1 (beginner) to 5 (professional): 4

Possible mentors: Anže

Test scripts, example scripts and documentation

Orange comes with substantial documentation for scripting which, in places, could be substantially improved. Also, Orange 2.5 with its new class hierarchy and functions is just about to be released, and some code snippets and corresponding documentation would both require a revision (note that  Reference Guide has already been rewritten). The project would embark in design of new use cases (snippets of code to demonstrate various aspects of orange), review of present set of snippets, and integration of code snippets within the documentation. Writing of a Orange Cookbook, or Orange User's Guide would be most welcome.

Snippets in documentation also serve as regression scripts upon which Orange is tested daily. Another purpose of this project might be to increase the number and coverage unit tests.

This could be also a good project if you would like to learn more about Orange, data mining and machine learning itself.

Useful skills: Proficiency in English (probably native speaker) if the target is documentation writing. Language/writing skills. Good knowledge of Python if the target is writing of unit tests.

Level from 1 (beginner) to 5 (professional): 3

Possible mentors: Blaž

Repository for add-ons

Orange supports add-ons which can add new features to scripting and new widgets (GUI). Currently, this feature is highly underused and used only for few internally developed add-ons. It would be great to open this in such way that also contributors around the world would be able to submit their add-ons to some central repository from which would then be possible install/use add-ons into Orange.

It would be good to try to integrate this with existing technologies and portals ( Bitbucket,  GitHub,  Python Package Index).

Useful skills: Python (along with distutils, python package directory structure rules and packaging skills).

Level from 1 (beginner) to 5 (professional): 3

Possible mentors: Matija, Mitar

Animations in Orange

Data visualization plays a very important role in understanding relationships from the data. Unfortunately, it is usually limited to two dimensions (e.g. scatter plot), additional information about the data can be presented by different colors, sizes and shapes of the points. There can be, however, additional variables in data (e.g. time) which can have a strong influence on the scatter plot. Like in  Gapminder time can be used as an "animation" variable. One could play animations and see how the scatter plot changes during the time (or any other continuous variable from the data).

Useful skills: Python. Widgets programming.

Level from 1 (beginner) to 5 (professional): 4

Possible mentors: Blaž