Orange

orngSQL: Orange's Interface to SQL databases

The orngSQL module provides access to relational databases from Orange. It currently supports MySQL through MySQL for Python and Postgres through Psycopg 2. Support for other Python DB-API 2.0 compliant databases can and will be added upon request.

Unlike OrngMySQL, orngSQL does not support the automatic detection of discrete attributes - this was intentional, since the required "select distinct" can be too slow on large databases. The support for writing data back into the database exists, but is rudimentary.

SQLReader

SQLReader establishes a connection with a database and provides the methods needed to fetch the data from the database into Orange.

Attributes

className
The name of the class variable
metaNames
The list of the names of meta attributes
discreteNames
The list of the names of the variables whose type is to be discrete
query
The SQL query to be executed
domain
The Orange domain used.

Methods

execute(statement [,domain=None])
Executes SQL select statement statement. If the domain argument is provided, the new domain is used.
update()
Executes a pending SQL query.
data()
returns The Orange ExampleTable produced by the last executed query.
connect(uri)
Connects to the database, specified in the uri.
disconnect()
Disconnects from the database.

SQLWriter establishes a connection with a database and provides the methods needed to create an appropriate table in the database and/or write the data from an ExampleTable into the database.

Methods

write(self, data, table, renameDict = None)
Writes the data into the table. If renameDict is provided, the names used are remapped - the orange attribute "X" is written into the database column renameDict["X"] of the table.
create(self, data, table, renameDict = None, typeDict = None)
Creates the required SQL table, then writes the data into it. If typeDict is provided, the SQL type of the orange attribute "X" is set to typeDict["X"].
connect(uri)
Connects to the database, specified in the uri.
disconnect()
Disconnects from the database.

loadSQL(filename, dontCheckStored = False, domain = None)Executes the sql query stored in the file under filename. This method allows the standard file reading functions with .sql files.

saveSQL Does nothing. This function will be implemented as soon as someone suggests a good way to do so.

Attribute Names and Types

The rows returned by an SQL query have to be converted into orange Examples. Each column in a row has to be converted into a certain attribute type. The following conversions between SQL and Orange types are used:

  • STRING and DATETIME attributes are converted to Orange strings
  • The attributes whose names are contained in discreteNames are converted to Orange discrete attributes.
  • The attributes whose names are listed in metaNames
  • Other attributes are converted to continuous Orange attributes
  • The attribute in className is set as the class attribute. If no className is set, the column with the name "class" in the returned SQL query is set as the class attribute. If no such column exists, the last column is set as the class attribute.

Orange domain handling

In Orange, attributes in different ExampleTables can represent different things, even if they share the same names. To remedy this, Orange employs domains.. To make the reuse and creation of new domains easier, Orange provides the DomainDepot.

When constructing an ExampleTable of results, sqlReader also has to somehow create a domain. This can be done in three ways:

  • By default, the domain is constructed automatically, based on className, metaNames, and discreteNames.
  • The domain can be provided as an argument to execute().
  • The domain can be set explicitly. The set domain is then used with the query, affecting the attribute types of the ExampleTable, returned by data()

Examples

The examples below were tested with the postgress adapter. To use them, you will probably have to modify them according to your database setup. In the example, the name of the database is assumed to be "test", the username "user" and the password "somepass"

We start with setting-up a small data table in the database to be used in our examples:

bus-postgres.sql (uses bus.txt)

DROP TABLE bus; CREATE TABLE bus (id varchar(5), line integer, daytime varchar, temp float, weather varchar, arrival varchar); \COPY "bus" FROM 'bus.txt' USING DELIMITERS ' ' SELECT * from bus;

bus-mysql.sql (uses bus.txt)

DROP TABLE bus; CREATE TABLE bus (id varchar(5), line integer, daytime varchar, temp float, weather varchar, arrival varchar); LOAD DATA LOCAL INFILE 'bus.txt' INTO TABLE bus; SELECT * from bus;

If everything goes well, running the script from shell by something like

mysql -u user -p < bus.sql or psql -u user -p < bus-postgres.sql

should produce something along the lines of:

id line daytime temp weather arrival 1 10 morning 10 sunny late 2 11 morning 13 rainy late 3 9 morning 15 rainy late 4 10 evening 25 sunny on-time 5 9 evening 29 sunny on-time 6 11 morning 26 sunny late 7 9 evening 9 rainy on-time 8 9 midday 20 rainy late 9 11 midday 21 sunny late 10 10 evening 5 rainy on-time 11 10 midday 8 rainy late 12 9 morning 5 rainy on-time

Let us now write the script that reads some data from this table from the database and produces an ExampleTable.

sql-new1.py

Running this script reports on four data instances and lists the attributes (with types) in new Orange data table:

['1', '10', 'morning', 10.000, 'sunny', 'late'] ['4', '10', 'evening', 25.000, 'sunny', 'on-time'] ['10', '10', 'evening', 5.000, 'rainy', 'on-time'] ['11', '10', 'midday', 8.000, 'rainy', 'late'] StringVariable 'id' EnumVariable 'line' EnumVariable 'daytime' FloatVariable 'temp' EnumVariable 'weather' EnumVariable 'arrival'

Now let us suppose that some of the attributes should actually be treated as discrete variables. The arrival column will now be explicitly set as the class variable.

sql-new2.py

import orange, orngMySQL t = orngMySQL.Connect('localhost','root','','test') data = t.query("SELECT * FROM busclass WHERE line='10'") for x in data: print x print for a in data.domain.attributes: print a print 'Class:', data.domain.classVar

By running it, we get:

['10', 'morning', 10.000, 'sunny', 'late'], {"id":'1'} ['10', 'evening', 25.000, 'sunny', 'on-time'], {"id":'4'} ['10', 'evening', 5.000, 'rainy', 'on-time'], {"id":'10'} ['10', 'midday', 8.000, 'rainy', 'late'], {"id":'11'} EnumVariable 'line' EnumVariable 'daytime' FloatVariable 'temp' EnumVariable 'weather' Class: EnumVariable 'arrival'

For our final examples, we load Iris data set in Orange, write it to a data base, and read back only the data records whose sepal length is below 5 mm. By default, all discrete and string orange attributes are treated as VARCHARs and all continuous are treated as FLOATs.

sql-new4.py (uses iris.tab)

# Description: Writes a data set to and reads from an SQL database # Category: file formats # Classes: ExampleTable, orngSQL.SQLReader, orngSQL.SQLWriter # Uses: iris.tab # Referenced: orngSQL.htm import orange, orngSQL, orngTree data = orange.ExampleTable("iris") print "Input data domain:" for a in data.domain.variables: print a r = orngSQL.SQLReader('mysql://user:somepass@localhost/test') w = orngSQL.SQLWriter('mysql://user:somepass@localhost/test') t.write('iris', data, overwrite=True) sel = t.query("SELECT petal_width, petal_length FROM iris WHERE sepal_length<5.0") print "\n%d instances returned" % len(sel) print "Output data domain:" for a in sel.domain.variables: print a

The output of the script above is:

Input data domain: FloatVariable 'sepal length' FloatVariable 'sepal width' FloatVariable 'petal length' FloatVariable 'petal width' EnumVariable 'iris' 22 instances returned Output data domain: FloatVariable 'petal_width' FloatVariable 'petal_length'

Usually, one needs more control over the sql names and types. Here's the same example as above, with some names and types set explicitly:

sql-new5.py(uses iris.tab)

# Description: Writes a data set to and reads from an SQL database # Category: file formats # Classes: ExampleTable, orngSQL.SQLReader, orngSQL.SQLWriter # Uses: iris.tab # Referenced: orngSQL.htm import orange, orngSQL, orngTree data = orange.ExampleTable("iris") print "Input data domain:" for a in data.domain.variables: print a r = orngSQL.SQLReader('mysql://user:somepass@localhost/test') w = orngSQL.SQLWriter('mysql://user:somepass@localhost/test') # the following line only works with mysql because it uses the enum type. w.create('iris', data, renameDict = {'sepal length':'seplen', 'sepal width':'sepwidth', 'petal length':'petlen', 'petal width':'petwidth'}, typeDict = {'iris':"""enum('Iris-setosa', 'Iris-versicolor', 'Iris-virginica')"""}) r.execute("SELECT petwidth, petlen FROM iris WHERE seplen<5.0;") data = r.data() print "\n%d instances returned" % len(data) print "Output data domain:" for a in data.domain.variables: print a

And the output:

Input data domain: FloatVariable 'sepal length' FloatVariable 'sepal width' FloatVariable 'petal length' FloatVariable 'petal width' EnumVariable 'iris' CREATE TABLE "iris" ( "seplen" DOUBLE, "sepwidth" DOUBLE, "petlen" DOUBLE, "petwidth" DOUBLE, "iris" enum('Iris-setosa', 'Iris-versicolor', 'Iris-virginica') ); 22 instances returned Output data domain: FloatVariable 'petwidth' FloatVariable 'petlen'

When accessing data, stored in a relational database, SQL queries are often reused. Unfortunately, an sql query is not sufficient to retrieve the data. Before executing the query, one must first connect to a database. It is also impossible to infer important properties, such as the desired class variable, from the retrieved data.

The simplest method of accessing data, stored in a relational database, from Orange, in most cases, is through .sql files with special comments. The examples listed below should be more or less self-explanatory. The statements after the keywords uri, discrete, class and metas are actually evaluated by the python interpeter. The returned values are then used appropriately.

For now, there is no write support for .sql files. To transfer data from Orange back into an SQL database, the orng should be used directly.

sql-new-embed.sql

--orng uri 'mysql://user:somepass@localhost/test' --orng discrete ['registration', 'num', 'time of day', 'arrival'] --orng meta ['weather', 'arrival', 'time'] --orng class ['arrival'] SELECT "id" as registration, line as num, daytime as "time of day", temp as temperature, weather, arrival FROM bus WHERE line='10';

sql-new-embed.py

(uses sql-new-embed.sql) # Description: Reads data from a database # Category: file formats # Classes: orngSQL.SQLReader # Referenced: orngSQL.htm import orange, orngSQLFile orange.registerFileType("SQL", orngSQLFile.loadSQL, None, ".sql") data = orange.ExampleTable('sql-new-embed.sql') for x in data: print x print for a in data.domain.attributes: print a

The expected output:

(('registration', 253, 2, 2, 2, 31, 0), ('num', 254, 2, 2, 2, 0, 1), ('time of day', 254, 7, 7, 7, 0, 1), ('temperature', 4, 2, 12, 12, 31, 1), ('weather', 254, 5, 5, 5, 0, 1), ('arrival', 254, 7, 7, 7, 0, 1)) (('registration', 253, 2, 2, 2, 31, 0), ('num', 254, 2, 2, 2, 0, 1), ('time of day', 254, 7, 7, 7, 0, 1), ('temperature', 4, 2, 12, 12, 31, 1), ('weather', 254, 5, 5, 5, 0, 1), ('arrival', 254, 7, 7, 7, 0, 1)) ['id', '10', 'morning', 10.0], {"weather":'sunny', "arrival":'late'} ['id', '10', 'evening', 25.0], {"weather":'sunny', "arrival":'on-time'} ['id', '10', 'evening', 5.0], {"weather":'rainy', "arrival":'on-time'} ['id', '10', 'midday', 8.0], {"weather":'rainy', "arrival":'late'} EnumVariable 'registration' EnumVariable 'num' EnumVariable 'time of day' FloatVariable 'temperature'