Orange Forum • View topic - Orange.data.Table for arff import issues

Orange.data.Table for arff import issues

A place to ask questions about methods in Orange and how they are used and other general support.

Orange.data.Table for arff import issues

Postby joelyoung » Fri Mar 22, 2013 1:39

I've found some issues with the arff inputting for sparse inputs.

if the input has extra white space such as:

{ 0 3, 4 5 }

the input barfs. This is fixable by adding a strip in io.py. If some of the attributes are string valued though, the input throws an exception. Since orange supports string data types, is there a particular reason not to support string attributes on import?

Would a patch fixing these be accepted?

Joel

Re: Orange.data.Table for arff import issues

Postby Ales » Fri Mar 22, 2013 19:07

joelyoung wrote:Would a patch fixing these be accepted?

Yes, of course.

[PATCH] Orange.data.Table for arff import issues

Postby joelyoung » Sat Mar 23, 2013 2:23

Code: Select all
--- io.py   2013-03-14 03:53:26.000000000 -0700
+++ new/io.py   2013-03-22 18:16:38.028372262 -0700
@@ -142,7 +142,7 @@
                 dd = x[1:-1]
                 dd = dd.split(',')
                 for xs in dd:
-                    y = xs.split(" ")
+                    y = xs.strip().split(" ")
                     if len(y) <> 2:
                         raise ValueError("the format of the data is error")
                     r[int(y[0])] = y[1]
@@ -194,7 +194,15 @@
                     a, s = make(atn, Orange.feature.Type.Discrete, vals, [], create_on_new)
                 else:
                     # real...
-                    a, s = make(atn, Orange.feature.Type.Continuous, [], [], create_on_new)
+                    attribute_type = x.split(' ')[-1].strip().lower()
+                   
+                    if   attribute_type == "numeric":
+                        a, s = make(atn, Orange.feature.Type.Continuous, [], [], create_on_new)
+                    elif attribute_type == "string":
+                        a, s = make(atn, Orange.feature.Type.String, [], [], create_on_new)
+                    else:
+                        raise ValueError("Invalid arff type in: [%s]"%((x)))
+
 
                 attributes.append(a)
                 attributeLoadStatus.append(s)


Here is a first stab at fixing up the arff input. I'm not sure if it would be better to raise on an invalid arff type or to bring it in as Unknown.

Suggestions?

joel

Re: Orange.data.Table for arff import issues

Postby Ales » Mon Mar 25, 2013 14:27

joelyoung wrote: I'm not sure if it would be better to raise on an invalid arff type or to bring it in as Unknown.

Suggestions?

I think it would be best to raise a warning and import the data as a String variable.

Re: Orange.data.Table for arff import issues

Postby joelyoung » Mon Mar 25, 2013 18:04

Sounds good re adding as string. I'll look into how to raise a warning that doesn't terminate the processing.

Re: Orange.data.Table for arff import issues

Postby Ales » Mon Mar 25, 2013 18:25

joelyoung wrote:Sounds good re adding as string. I'll look into how to raise a warning that doesn't terminate the processing.
Use the warnings module from the python standard library.

P.S. Specifically use the 'UserWarning' category.

Re: Orange.data.Table for arff import issues

Postby joelyoung » Wed Apr 24, 2013 1:10

Below is an updated patch that raises the UserWarning, and does a few other clean-ups for better reading arff files. It can also read attributes with single quoted strings with attributes.

The patch isn't the end-all-to-the-be-all parser as the arff format is very flexible. It is better than the previous one.

Code: Select all
--- io.py.orig   2013-03-14 03:53:26.000000000 -0700
+++ io.py   2013-04-23 17:08:27.329868120 -0700
@@ -118,6 +118,9 @@
 
 def loadARFF_Weka(filename, create_on_new=MakeStatus.Incompatible, **kwargs):
     """Return class:`Orange.data.Table` containing data from file in Weka ARFF format"""
+   
+    import csv
+
     if not os.path.exists(filename) and os.path.exists(filename + ".arff"):
         filename = filename + ".arff"
     f = open(filename, 'r')
@@ -128,7 +131,9 @@
     name = ''
     state = 0 # header
     data = []
+    line_number = 0
     for l in f.readlines():
+        line_number += 1
         l = l.rstrip("\n") # strip \n
         l = l.replace('\t', ' ') # get rid of tabs
         x = l.split('%')[0] # strip comments
@@ -140,17 +145,22 @@
             if x[0] == '{':#sparse data format, begin with '{', ends with '}'
                 r = [None] * len(attributes)
                 dd = x[1:-1]
-                dd = dd.split(',')
-                for xs in dd:
-                    y = xs.split(" ")
-                    if len(y) <> 2:
-                        raise ValueError("the format of the data is error")
-                    r[int(y[0])] = y[1]
+                terms = csv.reader([dd.strip()],quotechar="'",skipinitialspace=True,quoting=csv.QUOTE_NONE)
+                items = [z for z in [y for y in terms][0]]
+                for xs in items:
+                    csv_parts = csv.reader([xs],delimiter=" ", quotechar="'")
+                    parts = [z for z in [y for y in csv_parts][0]]
+           
+                    if len(parts) <> 2:
+                        raise ValueError("the format of the data is in error on line %d"%line_number)
+                    r[int(parts[0])] = parts[1]
                 data.append(r)
             else:#normal data format, split by ','
-                dd = x.split(',')
+                terms = csv.reader([x.strip()],quotechar="'",skipinitialspace=True,quoting=csv.QUOTE_NONE)
+                items = [z for z in [y for y in terms][0]]
+                print items
                 r = []
-                for xs in dd:
+                for xs in items:
                     y = xs.strip(" ")
                     if len(y) > 0:
                         if y[0] == "'" or y[0] == '"':
@@ -194,7 +204,17 @@
                     a, s = make(atn, Orange.feature.Type.Discrete, vals, [], create_on_new)
                 else:
                     # real...
-                    a, s = make(atn, Orange.feature.Type.Continuous, [], [], create_on_new)
+                    attribute_type = x.split(' ')[-1].strip().lower()
+                   
+                    if   attribute_type == "numeric" or \
+                         attribute_type == "real":
+         a, s = make(atn, Orange.feature.Type.Continuous, [], [], create_on_new)
+                    elif attribute_type == "string":
+         a, s = make(atn, Orange.feature.Type.String, [], [], create_on_new)
+                    else:
+         a, s = make(atn, Orange.feature.Type.String, [], [], create_on_new)
+         warnings.warn("Invalid arff type on line %d, attribute [%s].  Using 'string' attribute type."%(line_number,x), UserWarning, stacklevel=2)
+
 
                 attributes.append(a)
                 attributeLoadStatus.append(s)


Joel

Re: Orange.data.Table for arff import issues

Postby joelyoung » Thu Apr 25, 2013 17:40

Another issue I've noticed is that when importing from sparse arff files, unspecified values are marked as "?" rather than assigned the value "0" as specified in the weka arff format specification.

Unspecified values in sparse arff are explicitly not unknown, but instead 0 (zero).

Joel


Return to Questions & Support



cron