Version 27 (modified by gw0, 3 years ago) (diff)

Google Summer of Code Ideas

Here is a list of ideas for projects we thought up for what would be interesting and useful to do in a course of Google Summer of Code program for Orange. Of course you can propose also some other (your) idea(s). But of course connected with Orange, data mining, machine learning, artificial intelligence in general, bioinformatics and other fields in which we are interested in (or you can get us interested in).

Ideas are listed in no particular order.

Repository for add-ons

Orange supports add-ons which can add new features to scripting and new widgets (GUI). Currently, this feature is highly underused and used only for few internally developed add-ons. It would be great to open this in such way that also contributors around the world would be able to submit their add-ons to some central repository from which would then be possible install/use add-ons into Orange. This could be in some form of a web portal, maybe something along the lines of  Trac Hacks. The portal should encourage collaboration, code exchange, help and community. In this way also a global data mining and machine learning community collaboration will be improved.

It would be good to try to integrate this with existing technologies and portals ( Bitbucket,  GitHub,  Python Package Index).

Useful skills: Python. Web programming experience (suggested technologies are Django and jQuery).

Level from 1 (beginner) to 5 (professional): 3

Bridge between Orange and R

 R contains many great methods/tools which would be also very useful in Orange. To prevent duplication of work (and implementation) it would be great to be able to use those methods/tools directly in Orange (so that it is not necessary to reimplement them in Orange).

The idea is to research possibilities for this and then implement a future-proof bridge between Orange and R.

Useful skills: Python. C/C++. Experience with R. Experience with program-to-program interfaces.

Level from 1 (beginner) to 5 (professional): 4

Widgets in separate processes

Widgets in Orange Canvas currently run in a single process. As they are independent given their inputs, they could frequently work in parallel (in a  data-flow manner). The objective of this task would be to modify Orange Canvas so that each widget would run in its own process.

It would be also useful to separate GUI thread from main payload computation of widgets. Currently we are using also just one thread for everything (GUI thread) and we have, while widget is working, to repeatedly callback into the GUI to make it responsive. It would be great to have this separated so that code would be cleaner.

Useful skills: Python programming with multiple processes and threads. Qt and PyQt experience. Program design.

Level from 1 (beginner) to 5 (professional): 5

Support for parallel computation for scripting/backend

One other idea discusses the idea of making GUI process in parallel/separate processes. But this idea talks about having scripting part (backend part) of the Orange support (semi)automatic parallelisation/separation into processes and possible also processes over different computers. For example,  cross-validation with multiple folds is one simple example of easy parallelized technique, as each fold can be independently computed and then easily combined into the final result.

It would be good to analyze such opportunities for parallelization, find what they have in common and maybe devise a small helper library (possibly a wrapper for some existing grid computing system, like Xgrid) to use in code to easily make it run in parallel, if such environment is available, and run normally if not. And the of course move as much of already existing implementations to this new support for parallelization.

Useful skills: Python. Grid computing experience.

Level from 1 (beginner) to 5 (professional): 4

Benchmarking and optimizing Orange

It would be useful to test and benchmark different aspects of Orange and find bottle-necks. Furthermore, also find, propose and implement solutions for them. Orange implements various algorithms and some implementations are better than others. It would be useful to compare our implementations with others and see how they compare and if they should be improved.

Useful skills: Experience with testing and benchmarking software. Experience with common patterns which make programs run slowly.

Level from 1 (beginner) to 5 (professional): 3

Anova regression

Implement Anova regression, which would support arbitrary models, similar to the R implementation.

Useful skills: Python. The candidate should be familiar with statistics and computation with matrices (numpy).

Level from 1 (beginner) to 5 (professional): 3

More state-of-the-art regression algorithms

Find and research state-of-the art regression algorithms found in other packages (R, Weka, ...) and choose the most important representatives not found in Orange. Then reimplement them in Orange. Another option is to add just the wrappers for external libraries if they are open-source and compatible enough.

Useful skills: Python. Probably also statistics and computation with matrices (numpy).

Level from 1 (beginner) to 5 (professional): 4

Multi-label classification

Orange lacks support for multi-label learning and classification – on data-structure, algorithmic and GUI level. Update the structures, implement at least a few algorithms, adapt evaluation methods and add GUI support for it. A neat repository of literature on multi-label learning is ML&KD's web page  Learning from Multi-Label Data.

Useful skills: Python. C/C++. A bit of machine learning.

Level from 1 (beginner) to 5 (professional): 4

Improve k-Nearest Neighbors

For n training and m test examples with d attributes standard kNN checks the distances to all n training for each test examples. The time complexity is thus O(d*n*m). With a smarter implementation and/or under special conditions (e.g. Euclidean distance) this can be improved.

Useful skills: Python and C/C++.

Level from 1 (beginner) to 5 (professional): 3

Add built-in attribute weighting support

Traditionally learning algorithms do not support attribute weighting and treat all attributes as if they were equally important, but this is not the case in real data sets. It is known that variants of algorithms that use this additional data usually perform better. Because Orange at the moment does not include functionality for this kind of meta data, none of the algorithms support it out of the box. It would be great to add support for this (and possibly other) kind of meta data to attributes and other data structures and extend various learning algorithms to use it. For the second part some research of how to add this to each of the algorithm and probably how to generalize the solution will need to be done.

Useful skills: Python and C/C++. Understanding of machine learning algorithms.

Level from 1 (beginner) to 5 (professional): 4

Time-series analysis

Orange currently lacks any  time-series analysis tools. It would be great to develop some basic tools for dealing with them: reading, normalizing, basic pattern search, some (auto-)correlation and similar basic techniques, and so on. Research what other similar applications support and propose which features would be useful to have as a basic set of tools.

Useful skills: Python. Data analysis experience. Digital signal processing experience could also help.

Level from 1 (beginner) to 5 (professional): 4

Documentation proofreading and improvement

We have quite a lot of documentation for different aspects of Orange (widgets, scripting...) and of course a lot of it is still missing/incomplete/could be improved a lot. But some things could be done already on existing documentation, like checking the language itself (we are not English native speakers), proofreading, checking if all examples (still) really work, find new and/or better examples and so on.

This could be also a good project if you would like to learn more about Orange and data mining and machine learning itself.

Useful skills: Proficiency in English (probably native speaker). Language/writing skills.

Level from 1 (beginner) to 5 (professional): 3

Improve build system

Currently we are using home-brew build system which is hard to maintain and port to different platforms and also package for installation. It would be probably good to move to some standardized way which would make this better.

As an example of new build system it would be great to prepare packages for common Linux distributions and show how an installation package can be now better made for Windows and Mac OS X. Also normal Python packages should also be possible to make (eggs and similar) with common procedure.

Useful skills: Experience with cross-platform build systems and Python packaging.

Level from 1 (beginner) to 5 (professional): 4

Documentation screen casts

Current documentation of Orange widget is static. There are about a 100 widgets, all of them are described with screen shots and in text. This is a bit boring, and often does not provide much help for a really novice users. We believe that this is where Orange could shine. It is a really simple to use system, but current documentation fails to present it in this way. We would instead need fresh, informative screen casts, each no longer than 2 minutes, explaining what each widget does, how can it be integrated in a useful schema. The screen casts should probably show the narrator in first person, describing the problem that the widget solves in first person, and then focusing on a schema with an example.

The screen casts would then be included in the documentation of widgets, but would also be available in each widget by clicking on Help button. The project would in this way develop a tagged library of screen, probably published on YouTube.

Useful skills: English proficiency. Video and sound editing. Talent for staging. Some knowledge of data mining.

Level from 1 (beginner) to 5 (professional): 2

A social platform for Orange

Orange's visual programming environment can incorporate any of its over a 100 widgets into schemas that do clustering, classification, visualization-based analysis, PCA, and many others. A repository of typical schemas would be most welcome: the novice users could choose from the already defined schemas and analyze their own data with them, the intermediate users could use the library to improve/augment their own schemas, and experienced users would be able to store their inventions into a repository for others to use. Orange could also train a widget recommendation system from the schemas in the repository. The repository could feature tagging, liking, commenting, and everything that a social platform can provide. Schemas in the repository could be described in text or video.

In summary, the project would develop a new social platform for data mining solutions, possibly relying on existing solution (e.g.  myExperiment) or crafting something new (and simpler?). The repository would feature web access, but schemas from it should also be available in Orange (browsing, uploading and downloading). Seamless integration of repository and Orange is crucial to the success of this project.

Useful skills: Knowledge on how to develop a social platform (possibly Django). Python programming. Some knowledge of data mining/machine learning would help as well.

Level from 1 (beginner) to 5 (professional): 4.5