source: orange-bioinformatics/docs/reference/obiGEO.htm @ 1661:6c5f448f2563

Revision 1661:6c5f448f2563, 10.0 KB checked in by mitar, 2 years ago (diff)

Moved style CSS file.

Line 
1<html>
2<HEAD>
3<LINK REL=StyleSheet HREF="style.css" TYPE="text/css">
4<LINK REL=StyleSheet HREF="../style-print.css" TYPE="text/css" MEDIA=print></LINK>
5</HEAD>
6
7<BODY>
8<h1>obiGEO: an interface to NCBI's Gene Expression Omnibus</h1>
9
10<index name="NCBI">
11<index name="Gene Expression Omnibus">
12<index name="microarray data sets">
13
14<p>obiGEO provides an interface
15to <a href="http://www.ncbi.nlm.nih.gov/">NCBI</a>'s
16<a href="http://www.ncbi.nlm.nih.gov/geo/">Gene Expression Omnibus</a>
17repository. Currently, it only supports
18<a href="http://www.ncbi.nlm.nih.gov/sites/GDSbrowser">GEO
19DataSets</a> information querying and retreival.</p>
20
21<h2>GDSInfo</h2>
22
23<p><INDEX name="classes/GDSInfo (in obiGEO)">GDSInfo is the class that
24    can be used to retreive the infomation about
25    <a href=http://www.ncbi.nlm.nih.gov/sites/GDSbrowser>GEO Data
26    Sets</a>. The class accesses the Orange server file
27    that either resides on the local computer or is
28    automatically retreived from Orange server. Notice that the call
29    of this class does not access any NCBI's servers directly.</p>
30
31<p class=section>Methods</p>
32<dl class=attributes>
33<dt>GDSInfo(force_update=False)</dt>
34<dd><p>Constructor returning the object with GEO DataSets
35  information. If <code>force_update</code> is set
36  to <code>True</code>, the constructor will download GEO DataSets
37  information file (gds_info.pickled) from Orange server, otherwise,
38  it will first check if the local copy exists. The object returned
39  behaves like a dictionary: the keys are GEO DataSets IDs, and the
40  dictionary values for is a dictionary providing various information
41  about the particular data set.</p>
42
43<xmp class=code>>>> import obiGEO
44>>> info = obiGEO.GDSInfo()
45>>> info.keys()[:5]
46>>> ['GDS2526', 'GDS2524', 'GDS2525', 'GDS2522', 'GDS1618']
47>>> info['GDS2526']['title']
48'c-MYC depletion effect on carcinoma cell lines'
49>>> info['GDS2526']['platform_organism']
50'Homo sapiens'
51</xmp>
52</dd>
53</dl>
54
55<h2>GDS</h2>
56
57<p><INDEX name="classes/GDSInfo (in obiGEO)">GDS is a class that
58    provides methods for retreival of a specific GEO DataSet. The data
59    is provided as Orange's ExampleTable.
60
61<p class=section>Methods</p>
62<dl class=attributes>
63<dt>GDS(gdsname, verbose=False, force_download=False)</dt>
64<dd>Constructor returning the object to be used to retreive GEO
65  DataSet table (samples and gene expressions). <code>gdsname</code>
66  is an NCBI's ID for the data set in the form "GDSn" where "n" is a
67  GDS ID number. Construct checks a local cache directory if the
68  particular data file is loaded locally, else it downloads it from
69  <a href="ftp://ftp.ncbi.nih.gov/pub/geo/DATA/SOFT/GDS/">NCBI's GEO
70  FTP site</a>. The download is forced
71  if <code>force_download=True</code>. The compressed data file
72  resides in the cache directory after the call of the constructor
73  (call to <code>orngServerFiles.localpath("GEO")</code> reveals the
74  path of this directory).</p>
75
76<xmp class=code>>>> import obiGEO
77>>> gds = obiGEO.GDS("GDS1676")
78>>> print print ", ".join(gds.genes[:10])
79EXO1, BUB1B, LTB4R2, FOXA1, MEN1, LIFR, L1CAM, TRAF3, AKAP1, PIK3CD
80>>> gds.info["title"]
81'T cell leukemia cell response to human herpesvirus 6 infection: time course'
82>>> print gds
83GDS1676 (Homo sapiens), samples=8, features=2100, genes=667, subsets=8
84</xmp>
85</dd>
86
87<dt>getdata(report_genes=True, transpose=False,
88merge_function=variableMean, sample_type=None,
89remove_unknown=None)</dt>
90<dd><p>The call of this method returns the data from GEO DataSet in
91  Orange format. Micorarray spots reported in the GEO data set can
92  either be merged according to their gene id's
93  (<code>report_genes=True</code>) or can be left as spots. The data
94  matrix can have spots/genes in rows and samples in columns
95  (default, <code>transpose=False</code>) or samples in rows and
96  spots/genes in columns
97  (<code>transpose=True</code>). Argument <code>sample_type</code>
98  defines the type of annotation, or (if <code>transpose=True</code>)
99  the type of class labels to be included in the data set. Namely,
100  with <code>sample_type</code>, the entire annotation of samples will
101  be included either in the class value or in
102  the <code>.attributes</code> field of each data set
103  attributes. Spots with sample profiles that include unknown values
104  are retained by default (<code>remove_unknown=None</code>). They are
105  removed if the proportion of samples with unknown values
106  is above the threshold set by <code>remove_unknown</code>.</p>
107
108<p>The following illustrates how <code>getdata</code> is used to
109  construct a data set with genes in rows and samples in
110  columns. Notice that the annotation about each sample is retained
111  in <code>.attributes</code>.
112
113<xmp class=code>>>> import obiGEO
114>>> gds = obiGEO.GDS("GDS1676")
115>>> data = gds.getdata()
116>>> len(data)
117667
118>>> data[0]
119[?, ?, -0.803, 0.128, 0.110, -2.000, -1.000, -0.358], {"gene":'EXO1'}
120>>> data.domain.attributes[0]
121FloatVariable 'GSM63816'
122>>> data.domain.attributes[0].attributes
123Out[191]: {'dose': '20 U/ml IL-2', 'infection': 'acute ', 'time': '1 d'}
124</xmp>
125
126</dd>
127</dl>
128
129<h2>Examples</h2>
130
131<p>The following script prints out some information about a specific data set. It does not download the data set, just uses the (local) GEO data sets information file.</p>
132
133<p class="header"><a href="geo_gds1.py">geo_gds1.py</a></p>
134<xmp class=code>import obiGEO
135import textwrap
136
137gdsinfo = obiGEO.GDSInfo()
138gds = gdsinfo["GDS10"]
139
140print "ID:", gds["dataset_id"]
141print "Features:", gds["feature_count"]
142print "Genes:", gds["gene_count"]
143print "Organism:", gds["platform_organism"]
144print "PubMed ID:", gds["pubmed_id"]
145print "Sample types:"
146for sampletype in set([sinfo["type"] for sinfo in gds["subsets"]]):
147    ss = [sinfo["description"] for sinfo in gds["subsets"] if sinfo["type"]==sampletype]
148    print "  %s (%s)" % (sampletype, ", ".join(ss))
149print
150print "Description:"
151print "\n".join(textwrap.wrap(gds["description"], 70))
152</xmp>
153
154<p>The output of this script is:</p>
155
156<xmp class=code>ID: GDS10
157Features: 39114
158Genes: 20094
159Organism: Mus musculus
160PubMed ID: 11827943
161Sample types:
162  disease state (diabetic, diabetic-resistant, nondiabetic)
163  strain (NOD, Idd3, Idd5, Idd3+Idd5, Idd9, B10.H2g7, B10.H2g7 Idd3)
164  tissue (spleen, thymus)
165
166Description:
167Examination of spleen and thymus of type 1 diabetes nonobese diabetic
168(NOD) mouse, four NOD-derived diabetes-resistant congenic strains and
169two nondiabetic control strains.
170</xmp>
171
172<p>GEO data sets provide a sort of mini ontology for sample labeling. Samples belong to sample subsets, which in turn belong to specific types. Like above GDS10, which has three sample types, of which the subsets for the tissue type are spleen and thymus. If you are into using data sets for supervised data mining, then it would be useful to find out which of the data sets provide enough samples for each label. It is (semantically) convenient to perform classification within sample subsets of the same type. We therefore need a script that go through the entire set of data sets and finds those for which, for a specific type, there are enough samples within each of the subsets. The following script does the work. The function <code>valid</code> is passed the information about the data set and determines which subset types (if any) satisfy the "validity" criteria. The number of requested samples in the subset is by default set to <code>n=40</code>.</p>
173
174<p class="header"><a href="geo_gds5.py">geo_gds5.py</a></p>
175<xmp class=code>import obiGEO
176
177def valid(info, n=40):
178    """Return a set of subset types containing more than n samples in every subset"""
179    invalid = set()
180    subsets = set([sinfo["type"] for sinfo in info["subsets"]])
181    for sampleinfo in info["subsets"]:
182        if len(sampleinfo["sample_id"]) < n:
183            invalid.add(sampleinfo["type"])
184    return subsets.difference(invalid)
185
186def report(stypes, info):
187    """Pretty-print GDS and valid susbset types"""
188    for id, sts in stypes:
189        print id
190        for st in sts:
191            print "  %s:" % st,
192            gds = info[id]
193            print ", ".join(["%s/%d" % (sinfo["description"], len(sinfo["sample_id"])) \
194                             for sinfo in gds["subsets"] if sinfo["type"]==st])
195
196gdsinfo = obiGEO.GDSInfo()
197valid_subset_types = [(id, valid(info)) for id, info in gdsinfo.items() if valid(info)]
198report(valid_subset_types, gdsinfo)
199</xmp>
200
201<p>The requested number of samples, <code>n=40</code>, seems to be a quite a stringent criteria met - at the time of writing of this documentation - by only a few data sets (you may try to lower this threshold):</p>
202
203<xmp class="code">GDS1611
204  genotype/variation: wild type/48, upf1 null mutant/48
205GDS968
206  agent: none/57, UV/57, IR/57
207GDS1490
208  other: non-neural/50, neural/100
209GDS2373
210  gender: male/82, female/48
211GDS1293
212  tissue: raphe magnus/40, somatomotor cortex/41
213GDS2960
214  disease state: control/41, Marfan syndrome/60
215GDS1292
216  tissue: raphe magnus/40, somatomotor cortex/43
217GDS1412
218  protocol: no treatment/47, hormone replacement therapy/42
219</xmp>
220
221<p>Let us now pick one data file from the above (GDS2960) and see if we can predict the disease state. We will use LinearLearner, a fast variant of support vector machines with linear kernel, and within 10-fold cross validation measure AUC, the area under ROC. AUC is the probably for correctly distinguishing between two classes if picking the sample from target (e.g., the disease) and non-target class (e.g., control).</p>
222
223<p class="header"><a href="geo_gds6.py">geo_gds6.py</a></p>
224<xmp class="code">import obiGEO
225import orange
226import orngTest
227import orngStat
228
229gds = obiGEO.GDS("GDS2960")
230data = gds.getdata(sample_type="disease state", transpose=True)
231print "Samples: %d, Genes: %d" % (len(data), len(data.domain.attributes))
232
233learners = [orange.LinearLearner]
234results = orngTest.crossValidation(learners, data, folds=10)
235print "AUC = %.3f" % orngStat.AUC(results)[0]
236</xmp>
237
238<p>The output of this script is:</p>
239
240<xmp class="code">Samples: 101, Genes: 3979
241AUC = 0.985</xmp>
242
243<p>The AUC for this data set is very high, indicating that using this particular gene expression data it is almost trivial to separate the two classes.</p>
244
245
246</body>
247</html>
Note: See TracBrowser for help on using the repository browser.