wiki:GSoC/Ideas

Version 75 (modified by miha, 18 months ago) (diff)

Google Summer of Code Ideas

Here is a list of ideas for projects that might be interesting and useful to carry out during Google Summer of Code program for Orange. Your own ideas in methods for data analytics and visualization that would complement Orange are most welcome, too!

You can find more information about our participation in Google Summer of Code here.

Ideas are listed in no particular order.

Open ideas

Porting Python code to Orange 3.0

While migrating to Python 3.0 that broke compatibility with older versions of Python, we decided to seize the opportunity to clean up our house, too. A majority of Orange's C++ code has been rewritten, and while most functionality is still there, classes have been renamed, some have been eliminated, function arguments have been cleaned up and so forth. Now we would need to correspondingly change the parts in Python. This would require some routine refactoring and so forth, and also reimplementing some functionality that used to be in C++ but should be moved to Python. At the same time, we would need tools to make this process as automated as possible (an Orange equivalent of 2to3 script for Python).

Note that this is not about porting Orange from Python 2.X to Py3K: this is trivial and can be done in one evening (we tried it). The work ranges from running 2to3 to redesigning some architectural parts, so the student will have to be in constant contact with the core group.

Required skills: Good knowledge of Orange and Python.

Level from 1 (beginner) to 5 (professional): 4

Possible mentors: Janez

Data input from mldata.org

 mldata.org is an excellent machine learning data repository]. It would be great if Orange would have a script-based access to the repository that would also support querying (e.g., show me a list of all regression data sets with more than 100 features and 1000 data instances). Implementation of automatic querying and data download would also provide a basis for implementation of widget for browsing, searching and filtering of mldata.org data sets.  mldata.org at present does not feature programmatic access and querying, so carrying out this task may involve changing of  their code first.

Useful skills: Python.

Level from 1 (beginner) to 5 (professional): 4

Possible mentors: Blaž

Neural Networks

Orange implements many algorithms for classification, but currently lacks support for neural network learning. The task consists of implementing neural networks (multilayer perceptron, convolutional and deep belief networks) in Python with numpy. Neural networks will be implemented as an addon to Orange.

Level from 1 (beginner) to 5 (professional): 4

Possible mentors: Jure

Support for parallel computation for scripting/backend

The project will develop the support for (semi)automatic parallelisation/separation into processes, and possible also distribution of processes over different computers. For example,  cross-validation with multiple folds is one simple example of easy parallelized technique, as each fold can be independently computed and then easily combined into the final result. Parallelization should be seamless, from the point of view of the user – script writer.

It would be good to analyze such opportunities for parallelization, find what they have in common and maybe devise a small helper library (possibly a wrapper for some existing grid computing system) to use in code to easily make it run in parallel, if such environment is available, and run normally if not. And the of course move as much of already existing implementations to this new support for parallelization.

Useful skills: Python. Grid computing experience.

Level from 1 (beginner) to 5 (professional): 4

Possible mentors: Anže

Test scripts, example scripts and documentation

Orange comes with substantial documentation for scripting which, in places, could be substantially improved. Also, Orange 2.5 with its new class hierarchy and functions is just about to be released, and some code snippets and corresponding documentation would both require a revision (note that  Reference Guide has already been rewritten). The project would embark in design of new use cases (snippets of code to demonstrate various aspects of orange), review of present set of snippets, and integration of code snippets within the documentation. Writing of a Orange Cookbook, or Orange User's Guide would be most welcome.

Snippets in documentation also serve as regression scripts upon which Orange is tested daily. Another purpose of this project might be to increase the number and coverage unit tests.

This could be also a good project if you would like to learn more about Orange, data mining and machine learning itself.

Useful skills: Proficiency in English (probably native speaker) if the target is documentation writing. Language/writing skills. Good knowledge of Python if the target is writing of unit tests.

Level from 1 (beginner) to 5 (professional): 3

Possible mentors: Blaž

Repository for add-ons

Orange supports add-ons which can add new features to scripting and new widgets (GUI). Currently, this feature is highly underused and used only for few internally developed add-ons. It would be great to open this in such way that also contributors around the world would be able to submit their add-ons to some central repository from which would then be possible install/use add-ons into Orange.

It would be good to try to integrate this with existing technologies and portals ( Bitbucket,  GitHub,  Python Package Index).

Useful skills: Python (along with distutils, python package directory structure rules and packaging skills).

Level from 1 (beginner) to 5 (professional): 3

Possible mentors: Matija, Mitar

Widgets in separate processes

Widgets in Orange Canvas currently run in a single process. As they are independent given their inputs, they could frequently work in parallel (in a  data-flow manner). The objective of this task would be to modify Orange Canvas so that each widget would run in its own process.

It would be also useful to separate GUI thread from main payload computation of widgets. Currently we are using also just one thread for everything (GUI thread) and we have, while widget is working, to repeatedly callback into the GUI to make it responsive. It would be great to have this separated so that code would be cleaner.

Useful skills: Python programming with multiple processes and threads. Qt and PyQt experience. Program design.

Level from 1 (beginner) to 5 (professional): 5

Possible mentors: Marko

Animations in Orange

Data visualization plays a very important role in understanding relationships from the data. Unfortunately, it is usually limited to two dimensions (e.g. scatter plot), additional information about the data can be presented by different colors, sizes and shapes of the points. There can be, however, additional variables in data (e.g. time) which can have a strong influence on the scatter plot. Like in  Gapminder time can be used as an "animation" variable. One could play animations and see how the scatter plot changes during the time (or any other continuous variable from the data).

Useful skills: Python. Widgets programming.

Level from 1 (beginner) to 5 (professional): 4

Possible mentors: Blaž

dictyExpress

The  dictyExpress is an interactive, web-based exploratory data analytics application for analysis of over 1,000 Dictyostelium gene expression experiments. The app is basically a web interface to Orange algorithms that run in the cloud. We are porting the obsolete Flash user interface to HTML5/JavaScript. This is an extremely multidisciplinary project that requires wide range of skills.

Useful skills: JavaScript, HTML5, CSS3, AngularJS, Bootstrap, Python, Django, basic bioinformatics knowledge.

Possible mentors: Miha