Changeset 1938:7e1dc12fd41e in orange-bioinformatics


Ignore:
Timestamp:
12/13/13 15:31:46 (4 months ago)
Author:
markotoplak
Branch:
default
Message:

Converted obiGEO documentation to rst.

Files:
6 added
2 edited

Legend:

Unmodified
Added
Removed
  • docs/rst/reference/geo.rst

    r1937 r1938  
    77 
    88************************************************************** 
    9 An interface to NCBI's Gene Expression Omnibus (:mod:`obiGEO`) 
     9NCBI's Gene Expression Omnibus interface (:mod:`obiGEO`) 
    1010************************************************************** 
    1111 
     
    3535    Out[191]: {'dose': '20 U/ml IL-2', 'infection': 'acute ', 'time': '1 d'} 
    3636 
    37 GDS 
    38 === 
     37GDS classes 
     38=========== 
    3939 
    4040.. autoclass:: GDSInfo 
    4141   :members: 
    4242 
    43 :: 
     43An example that uses obj:`GDSInfo`:: 
    4444 
    45     >>> import obiGEO 
     45    >>> from Orage import obiGEO 
    4646    >>> info = obiGEO.GDSInfo() 
    4747    >>> info.keys()[:5] 
     
    5757 
    5858Examples 
    59 ================== 
     59======== 
    6060 
    61 Genes can have multiple aliases. When we combine data from different 
    62 sources, for example expression data with GO gene sets, we have to 
    63 match gene aliases representing the same genes. All implemented matching 
    64 methods are based on sets of gene aliases for one gene. 
     61The following script prints out some information about a specific data 
     62set. It does not download the data set, just uses the (local) GEO data 
     63sets information file (:download:`geo_gds1.py <code/geo_gds1.py>`). 
    6564 
    66 .. autoclass:: Matcher 
    67    :members: 
     65.. literalinclude:: code/geo_gds1.py 
     66   :lines: 6- 
    6867 
    69 This modules provides the following gene matchers: 
     68The output of this script is:: 
    7069 
    71 .. autoclass:: MatcherAliasesKEGG 
     70    ID: GDS10 
     71    Features: 39114 
     72    Genes: 19883 
     73    Organism: Mus musculus 
     74    PubMed ID: 11827943 
     75    Sample types: 
     76      disease state (diabetic, diabetic-resistant, nondiabetic) 
     77      strain (NOD, Idd3, Idd5, Idd3+Idd5, Idd9, B10.H2g7, B10.H2g7 Idd3) 
     78      tissue (spleen, thymus) 
    7279 
    73 .. autoclass:: MatcherAliasesGO 
    74  
    75 .. autoclass:: MatcherAliasesDictyBase 
    76  
    77 .. autoclass:: MatcherAliasesNCBI 
    78  
    79 .. autoclass:: MatcherAliasesEnsembl 
    80  
    81 .. autoclass:: MatcherDirect 
    82  
    83 Gene name matchers can applied in sequence (until the first match) or combined (overlapping sets of gene aliases of multiple gene matchers are combined) with the :obj:`matcher` function. 
    84  
    85 .. autofunction:: matcher 
    86  
    87 The following example tries to match input genes onto KEGG gene aliases (:download:`genematch2.py <code/genematch2.py>`). 
    88  
    89 .. literalinclude:: code/genematch2.py 
    90  
    91 Results show that GO aliases can not match onto KEGG gene IDs. For the last gene only joined GO and KEGG aliases produce a match:: 
    92  
    93         gene         KEGG           GO      KEGG+GO 
    94         cct7    hsa:10574         None    hsa:10574 
    95         pls1     hsa:5357         None     hsa:5357 
    96         gdi1     hsa:2664         None     hsa:2664 
    97        nfkb2     hsa:4791         None     hsa:4791 
    98       a2a299         None         None     hsa:7052 
     80    Description: 
     81    Examination of spleen and thymus of type 1 diabetes nonobese diabetic 
     82    (NOD) mouse, four NOD-derived diabetes-resistant congenic strains and 
     83    two nondiabetic control strains. 
    9984 
    10085 
    101 The following example finds KEGG pathways with given genes (:download:`genematch_path.py <code/genematch_path.py>`). 
     86GEO data sets provide a sort of mini ontology for sample labeling. Samples 
     87belong to sample subsets, which in turn belong to specific types. Like 
     88above GDS10, which has three sample types, of which the subsets for 
     89the tissue type are spleen and thymus. For supervised data mining it 
     90would be useful to find out which data sets provide enough samples for 
     91each label. It is (semantically) convenient to perform classification 
     92within sample subsets of the same type. We therefore need a script 
     93that goes through the entire set of data sets and finds those, where 
     94there are enough samples within each of the subsets for a specific 
     95type. The following script does the work. The function ``valid`` 
     96determines which subset types (if any) satisfy our criteria. The 
     97number of requested samples in the subset is by default set to ``n=40`` 
     98(:download:`geo_gds5.py <code/geo_gds5.py>`). 
    10299 
    103 .. literalinclude:: code/genematch_path.py 
     100.. literalinclude:: code/geo_gds5.py 
     101   :lines: 8- 
    104102 
    105 Output:: 
     103The requested number of samples, ``n=40``, seems to be a quite 
     104a stringent criteria met - at the time of writing of this documentation - 
     105by 35 sample subsets. The output starts with:: 
    106106 
    107     Fndc4 is in 
    108       / 
    109     Itgb8 is in 
    110       PI3K-Akt signaling pathway 
    111       Focal adhesion 
    112       ECM-receptor interaction 
    113       Cell adhesion molecules (CAMs) 
    114       Regulation of actin cytoskeleton 
    115       Hypertrophic cardiomyopathy (HCM) 
    116       Arrhythmogenic right ventricular cardiomyopathy (ARVC) 
    117       Dilated cardiomyopathy 
    118     Cdc34 is in 
    119       Ubiquitin mediated proteolysis 
    120       Herpes simplex infection 
    121     Olfr1403 is in 
    122       Olfactory transduction 
     107    GDS1611 
     108      genotype/variation: wild type/48, upf1 null mutant/48 
     109    GDS3553 
     110      cell type: macrophage/48, monocyte/48 
     111    GDS3953 
     112      protocol: training set/46, validation set/47 
     113    GDS3704 
     114      protocol: PUFA consumption/42, SFA consumption/42 
     115    GDS3890 
     116      agent: vehicle, control/46, TCE/48 
     117    GDS1490 
     118      other: non-neural/50, neural/100 
     119    GDS3622 
     120      genotype/variation: wild type/56, Nrf2 null/54 
     121    GDS3715 
     122      agent: untreated/55, insulin/55 
    123123 
     124Let us now pick data set GDS2960 and see if we can predict the disease 
     125state. We will use logistic regression, and within 10-fold cross 
     126validation measure AUC, the area under ROC. AUC is the probably for 
     127correctly distinguishing between two classes if picking the sample from 
     128target (e.g., the disease) and non-target class (e.g., control). From 
     129(:download:`geo_gds6.py <code/geo_gds6.py>`) 
    124130 
     131.. literalinclude:: code/geo_gds6.py 
     132 
     133The output of this script is:: 
     134     
     135    Samples: 101, Genes: 4068 
     136    AUC = 0.960 
     137 
     138The AUC for this data set is very high, indicating that using these gene 
     139expression data it is almost trivial to separate the two classes. 
  • orangecontrib/bio/obiGEO.py

    r1936 r1938  
    9494    <ftp://ftp.ncbi.nih.gov/pub/geo/DATA/SOFT/GDS/>`_.  The compressed 
    9595    data file resides in the cache directory after the call of the 
    96     constructor (call to ``orngServerFiles.localpath("GEO")`` reveals 
     96    constructor (call to ``Orange.utils.serverfiles.localpath("GEO")`` reveals 
    9797    the path of this directory). 
    9898 
Note: See TracChangeset for help on using the changeset viewer.