source: orange-bioinformatics/doc/modules/obiGsea.htm @ 937:430db20ac9b0

Revision 937:430db20ac9b0, 10.9 KB checked in by markotoplak, 5 years ago (diff)

obiGsea: documented using datasets with samples as attributes

Line 
1<html>
2
3<head>
4<title>obiGsea: Gene Set Enrichment Analysis</title>
5<link rel=stylesheet href="../style.css" type="text/css">
6<link rel=stylesheet href="style-print.css" type="text/css" media=print>
7</head>
8
9<body>
10<h1>obiGsea: Gene Set Enrichment Analysis</h1>
11<index name="modules/gene set enrichment analysis">
12
13<p>Gene Set Enrichment Analysis (GSEA) is a method which tries to identify groups of genes that are
14regulated together. It is implemented in module obiGsea, which is included in Orange for Functional Genomics package.
15To use obiGsea you need to install Orange for Functional Genomics.</p>
16
17<p>GSEA takes gene
18expression data for multiple samples with their phenotypes and computes
19gene set enrichment for given gene sets. To use it run
20<code>runGSEA</code> method with the following arguments:</p>
21
22<h2>runGSEA</h2>
23<index name="gsea">
24<index name="GSEA">
25<index name="gene+set">
26
27<p class=section>Arguments</p>
28<dl class=arguments>
29
30  <dt>data</dt>
31  <dd>An <A href="ExampleTable.htm"><CODE>ExampleTable</CODE></A> with gene expression data.</dd>
32
33  <dt>matcher</dt>
34  <dd>An initialized gene matcher.</dd> 
35
36  <dt>classValues</dt>
37  <dd>A pair of class values describing phenotypes that are chosen as two distinct phenotypes on which gene correlations
38  are computed. Only examples with one of chosen class values are considered for analysis. If not specified, all in <code>classVar</code> attribute descriptor are used.</dd>
39
40  <dt>geneSets</dt>
41  <dd>A python dictionary of gene sets, where key is a gene set name which points to a list of gene aliases for genes
42  in the gene set. Default: gene sets in your collection directory.</dd>
43
44  <dt>n</dt>
45  <dd>GSEA computes gene set significance by permutation tests. This parameter specifies the number
46  of permutations. Default: 100.</dd>
47
48  <dt>permutation</dt>
49  <dd>Type of permutation. If <code>"class"</code>, class values (phenotypes) are permuted. This is the default.
50  However, if number of samples is small (less than 10), it is advisable to use <code>"gene"</code> permutations even
51  though they ignore gene-gene interactions.</dd>
52
53  <dt>minSize, maxSize</dt>
54  <dd>Minimum and maximum number of genes from gene set also present in the data set for that gene set to be analyzed.
55  Defaults: 3 and 1000.</dd>
56
57  <dt>minPart</dt>
58  <dd>Minimum fraction of genes from the gene set also present in the data set for that gene set to be analyzed. Default: 0.1.</dd> 
59
60  <dt>atLeast</dt>
61  <dd>Number of valid attribute values for each class value needed for a certain attribute (gene) to be considered in analysis. Attributes failing to meet this criterium are ignored. Default: 3.</dd>
62
63  <dt>phenVar</dt>
64  <dd>Explains marking of different phenotypes. By default the <CODE>data.domain.classVar</CODE> is used to distinguish them, if it exists. If the argument is set to <code>False</code>, GSEA presumes that the genes are already ranked. If the value of <code>phenVar</code> is a string, then an entry from <code>attributes</code> dictionary of individual attributes specifies the phenotype. In latter case, each attribute represents one sample.</dd>
65
66  <dt>geneVar</dt>
67  <dd>Specifies locations of gene names. If <code>True</code>, gene names are attribute names. If it is a string, then they are taken from the corresponding entry from <code>attributes</code> dictionary of individual attributes. If each attribute specifies a sample, then the user should pass the meta variable containing the gene names. Defaults to attribute names if each example specifies one sample.</dd>
68
69</dl>
70
71<!-- es, nes, pval, fdr, os, ts, genes -->
72
73Method <code>runGSEA</code> returns a dictionary where key is a gene set label and its value a dictionary
74of:
75<ul>
76<li> enrichment score (key <code>es</code>),
77<li> normalized enrichment score (key <code>nes</code>),
78<li> P-value (key <code>p</code>),
79<li> FDR (key <code>fdr</code>),
80<li> whole gene set size (key <code>size</code>),
81<li> number of matched genes from the gene set (key <code>matched_size</code>),
82<li> gene names from the data set for matched genes from the gene set (key <code>genes</code>).
83</ul>
84
85<p>A note on gene name matching. Gene name matching is performed with the help of KEGG database.
86A gene from a gene set is tried to be matched with a gene from the data set. If an alias for a gene from the
87gene set is the same as an alias for a gene in the data set, then those aliases are matched. If not,
88it is checked if gene alias from the gene set and gene alias from the data set are both gene
89aliases of the same gene according to KEGG database for a given organism. If they are, we have a match.</p>
90
91<h3>Example 1</h3>
92
93<p>We present a simple usage example. Data used here are not gene expression
94data. For the method to work we had to specify our own sets of attributes that seem to "belong together".</p>
95
96<p class="header"><a href="gsea1.py">gsea1.py</a> (uses <a href=
97"http://www.ailab.si/orange/doc/datasets/iris.tab">iris.tab</a>)</p>
98
99<xmp class=code>import orange, obiGsea, obiGene
100
101data = orange.ExampleTable("iris")
102
103gen1 = dict([
104    ("sepal",["sepal length", "sepal width"]),
105    ("petal",["petal length", "petal width", "petal color"])
106    ])
107
108res = obiGsea.runGSEA(data, matcher=obiGene.matcher([]), minSize=2, geneSets=gen1)
109print "%5s  %6s %6s %s" % ("LABEL", "NES", "P-VAL", "GENES")
110for name,resu in res.items():
111    print "%5s  %6.3f %6.3f %s" % (name, resu["nes"], resu["p"], str(resu["genes"]))
112</xmp>
113
114<p>Corresponding output:</p>
115
116<xmp class=code>LABEL     NES  P-VAL GENES
117petal  -1.117  0.771 ['petal length', 'petal width']
118sepal   1.087  0.630 ['sepal length', 'sepal width']
119</xmp>
120
121<p>We can see that a "gene" labeled "petal color" was not used, because it couldn't be matched to any attribute in the data set.</p>
122
123<h3>Example 2: using correlation data directly</h3>
124
125<p>
126GSEA can also directly use correlation data between individual genes and a phenotype. Two kinds of input data trigger this functionality: (1) input data with only one example (attribute names are gene names) or (2) there is only one continuous attribute in the given data set's domain. In latter case gene correlations are read from the continuous attribute and the gene names are taken from the first <code>StringVariable</code> in the domain. We have choosen the latter representation for this example, since this is the formatting returned by <code>obiDicty</code> module. Only results for ten pathways are listed.
127</p>
128
129<p class="header"><a href="gsea2.py">gsea2.py</a></p>
130
131<xmp class=code>import obiDicty
132import obiGeneSets
133import obiGsea
134import orange
135import obiGene
136
137dbc = obiDicty.DatabaseConnection()
138data = dbc.getData(sample='pkaC-', time="8")[0] #get first chip
139
140print "First 10 examples"
141for ex in data[:10]:
142    print ex
143
144matcher=obiGene.matcher([[obiGene.GMKEGG("ddi"),obiGene.GMDicty()]])
145
146genesets =  obiGeneSets.collections([":kegg:ddi"])
147res = obiGsea.runGSEA(data, matcher=matcher, minPart=0.05, geneSets=genesets,
148    permutation="gene")
149
150print "GSEA results"
151print "%-40s %6s %6s %6s %7s" % ("LABEL", "NES", "P-VAL", "SIZE", "MATCHED")
152for name,resu in res.items()[:10]:
153    print "%-40s %6.3f %6.3f %6d %7d" % (name[:30], resu["nes"], resu["p"],
154        resu["size"], resu["matched_size"])
155</xmp>
156
157<p>Corresponding output:</p>
158
159<xmp class=code>First 10 examples
160[-0.055], {"DDB":'#*'}
161[0.003], {"DDB":'#17S_ribosomal_gene'}
162[-0.168], {"DDB":'#B-elongation_factor'}
163[-0.134], {"DDB":'DDB_G0272032'}
164[0.330], {"DDB":'DDB_G0291982'}
165[0.229], {"DDB":'DDB_G0268264'}
166[0.108], {"DDB":'DDB_G0293286'}
167[-0.130], {"DDB":'DDB_G0284295'}
168[0.189], {"DDB":'DDB_G0285435'}
169[0.082], {"DDB":'DDB_G0287595'}
170GSEA results
171LABEL                                       NES  P-VAL   SIZE MATCHED
172[KEGG] Alanine and aspartate m            1.367  0.146     21      12
173[KEGG] Cysteine metabolism                1.328  0.098      8       3
174[KEGG] Folate biosynthesis                0.703  0.862     12       4
175[KEGG] 3-Chloroacrylic acid de           -0.734  0.750      4       3
176[KEGG] Porphyrin and chlorophy           -0.990  0.517     14       5
177[KEGG] Drug metabolism - other            0.951  0.545     15      12
178[KEGG] Sulfur metabolism                 -0.538  0.980      6       3
179[KEGG] N-Glycan biosynthesis              1.384  0.111     22       6
180[KEGG] Terpenoid biosynthesis            -1.316  0.125      7       3
181[KEGG] Aminoacyl-tRNA biosynth           -1.105  0.292     42      12
182</xmp>
183
184<h3>Example 3: attributes as samples (from Gene Expression Omnibus)</h3>
185
186<p>
187Data sets where attributes represents samples can also be used, if <code>phenVar</code> and <code>geneVar</code> parameters are specified.
188Only results for top ten pathways are listed. To obtain valid results, we would have to increase the number of data set permutations (parameter <code>n</code>).
189</p>
190
191<p class="header"><a href="gsea3.py">gsea3.py</a></p>
192
193<xmp class=code>import obiGeneSets
194import obiGsea
195import orange
196import obiGene
197import obiGEO
198
199import obiGEO
200gds = obiGEO.GDS("GDS10")
201data = gds.getdata()
202
203print "Possible phenotype descriptors:"
204print map(lambda x: x[0], obiGsea.allgroups(data).items())
205
206matcher=obiGene.matcher([obiGene.GMKEGG("9606")])
207
208phenVar = "tissue"
209geneVar = "gene" #use gene meta variable for gene names
210
211genesets =  obiGeneSets.collections([":kegg:hsa"])
212res = obiGsea.runGSEA(data, matcher=matcher, minPart=0.05, geneSets=genesets,
213    permutation="class", n=10, phenVar=phenVar, geneVar=geneVar)
214
215print
216print "GSEA results (choosen descriptor: tissue)"
217print "%-40s %6s %6s %6s %7s" % ("LABEL", "NES", "FDR", "SIZE", "MATCHED")
218for name,resu in sorted(res.items(), key=lambda x: x[1]["fdr"])[:10]:
219    print "%-40s %6.3f %6.3f %6d %7d" % (name[:30], resu["nes"], resu["fdr"],
220        resu["size"], resu["matched_size"])
221</xmp>
222
223<p>Corresponding output:</p>
224
225<xmp class=code>Possible phenotype descriptors:
226['disease state', 'strain', 'tissue']
227
228GSEA results (choosen descriptor: tissue)
229LABEL                                       NES    FDR   SIZE MATCHED
230[KEGG] DNA replication                    1.783  0.000     36      35
231[KEGG] Cell cycle                         1.832  0.000    119      96
232[KEGG] T cell receptor signali            2.006  0.000     94      70
233[KEGG] Non-homologous end-join            1.795  0.000     14      13
234[KEGG] Glycine, serine and thr           -1.828  0.022     42      33
235[KEGG] Porphyrin and chlorophy           -1.829  0.028     41      24
236[KEGG] Thyroid cancer                     1.666  0.036     29      24
237[KEGG] Adipocytokine signaling           -1.880  0.037     67      53
238[KEGG] Ubiquitin mediated prot            1.597  0.051    137     112
239[KEGG] Glycosaminoglycan degra           -1.899  0.055     18      16
240</xmp>
241
242
243
244<HR>
245<H2>References</H2>
246
247<P>Subramanian, Aravind   and Tamayo, Pablo   and Mootha, Vamsi  K.  and Mukherjee, Sayan   and Ebert, Benjamin  L.  and Gillette, Michael  A.  and Paulovich, Amanda   and Pomeroy, Scott  L.  and Golub, Todd  R.  and Lander, Eric  S.  and Mesirov, Jill  P. Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. PNAS, 2005.</P>
248
249</body>
250</html>
251
Note: See TracBrowser for help on using the repository browser.