SiNoPsis v1.0  About  Home

TABLE OF CONTENTS:

  1. Query Form:
    1. Genomic Locus*
    2. Master SNP
    3. Cell Lines
    4. Variation
  2. Datasets and Tracks Description:
    1. Application Flow Chart
    2. Databases from External Sources
    3. Transcription Factors Track
    4. Annotation Track
    5. Regulation Track
    6. Histone Marks Track
    7. Chromatin State Track
    8. SNPs Track
  3. Color Codes Description:
    1. Tracks Color Legend
    2. Overlapping Genes
    3. Close-up View
  4. CRE-Region Description
  5. Usage Example: DDIT4 Gene
    1. Setting Genomic Locus
    2. Providing Master SNP
    3. Applying Cell Filters
    4. Choosing Variation
    5. Results Overview
  6. SNPs Example Queries:
    1. rs6755102 (GAD1 gene)
    2. rs456998 (FCHSD1 gene)
    3. rs33925946 (AKT1 gene)
*All coordinates were taken from the GRCh37/hg19 Human Genome Assembly


top
1. Query Form

Here you can find a brief description of the input form. It is basically divided into four distinct panels: "Genomic Locus", "Master SNP", "Cell lines" and "Variation". Question mark buttons all over the form, even the panel section headers, provide the user with short descriptions about each of the input fields on hovering those question marks with the mouse pointer. Those help items are depicted as  
?
 ; you can try to hover on this icon to see the effect.

The below image defines the letter codes that link each input field from the web form to the corresponding descriptions available on the following tutorial subsections :


1.1 Genomic Locus


The first panel handles the input coordinates and thus it will define the start and end positions where the analysis will be performed. This length comes from either the coordinates stored for the genes or the genomic coordinates themselves. Genic coordinates are stored based on identifiers, either set as Gene symbols, RefSeq IDs, or Ensembl IDs. Moreover, the final length considered for the dowsntream analyses is computed by adding the flanking-region lengths to the selected genomic locus. However, due to speed and transfer limitations, the maximum length that can be defined anywhere on the form is 350kbp.

A

The form can take standard gene names (Gene Symbol) and gene identifiers from RefSeq and Ensembl.

The general format for gene names is the following:

Gene -> Official Gene Symbol (not aliases)
RefSeq -> NM_nnnnnn
Ensembl -> ENSGnnnnnnn.

B

As already mentioned, searches can be performed by genomic coordinates as well, taking into account that user provides valid coordinates from GRCh37/hg19 human genome assembly. We remind that the maximum length allowed is 350kbp, due to speed and transfer limitations.

The coordinates must conform the following pattern:    chr{A}:{start}-{end}
For instance, here we select the segment between position 79,349,683 and 82,346,281 of chromosome 4:    chr4:79349683-82346281

C

Last input field of the panel allows user to change the size of the flanking regions. Those flanks correspond to a genomic segment upstream/downstream the selected genomic loci (either if you provide a gene name or genomic coords). Just consider that the limit is 20kbp, both upstream and dowstream.

User can provide two positive integer values, both greater than 0 and separated by the slash symbol, as shown here: {upstream}/{downstream} (i.e. 3/5).
The first number redefines the upstream region (this depends on gene strand), while the second provides the downstream region length. when user provides coordinates or master SNP only, not gene names or identifiers, the strand defaults to forward ("+"). When gene names or identifiers are provided by user, then the corresponding genomic annotation strand is retrieved along with the start and end coords. When this field is set to 0/0, the program will add a minimum of 1kbp to both ends, upstream and downstream, just to ensure proper spacing on the final drawings.

1.2 Master SNP


D

The Master SNP corresponds to a SNP for which the user already knows whether it has functional annotated evidences or that has been found statistically significant from another experiment. By providing this SNP the user can get additional linkage disequilibrium (LD) information with respect to all the SNPs mapped over the genomic region appearing on the final genomic selected region. Such LD is computed for each single SNP on that region and the master SNP. That way the user can also get additional SNP candidates that are linked to the one provided as master; this can assist in the design of further downstream experiments. In the results table, both HTML or PDF versions, this master SNP will be prefixed by "**".

There are two types of analyses that can be done with a master SNP:

A) Search a gene/RefSeq/Ensembl or by coordinates, together with this SNP. It is important to mention that the user must know that the given SNP is within the coordinates given for the genomic location.

B) The master SNP itself, by leaving the coordinates and gene fields empty. By default, the web tool adds then 2kbp to the SNP genomic position, both upstream and downstream. Moreover, a flanking region can be also summed up to those new coordinates. Again, in order to define another flanking length, gene and genomic locus form fields must remain empty, otherwise the program will consider that SNP was part of the genic locus instead of an independent feature.

1.3 Cell Lines


By default, if no cell line is selected on the corresponding panel of the input form, the search is made over all cell lines, and those falling inside the genomic segment selected will considered and drawn. On the opposite, if only one cell line is selected, the marks present in the given genomic locus will be drawn as one thick ribbon, in fact representing only marks relevant for that cell line (see DDIT4 example results). In the form, those cell lines lacking annotated features for that marks track are highlighted in italics and a gray color, to make the cell lines present in each track more readable. Multiple cell lines selection (default) has the advantage that the resulting figure can give visual clues about the chromatin state at the region of the selected SNPs, for instance.

E


This track will filter data for cell lines having information regarding histone marks relevant to chromatin state: inactivation (H3K27me3, H3K9me3), and the oposite effect, activation (open chromatin).

F

This track will filter data for cell lines for which regulatory information is available, corresponding to promoters, promoter flanking regions, and enhancers.

G

This track will filter data for cell lines that have information about different histone marks, two of them found "abundant" at promoter sites (H3K9ac, H3K4me3) and the other two at enhancer sites (H3K4me1, H3K27ac).

1.4 Variation


H

This form field shows the available tissues for which an eQTLs can be recorded in the database.

I


Last form field allows user to choose SNPs annotated from different human populations. At this moment those are the populations included in our database:

EUR (European)
AFR (African)
EAS (East Asian)
AMR (Ad Mixed American)
SAS (South Asian).



top

2. Datasets and Tracks Description

This section describes the resources gathered to build the underlying database behind the web application. It starts with a diagram of the data sources and the workflow of the web app, together with the description of each of the tracks represented in the resulting figures.



2.1 Application Flow Chart


The web application integrates information from different databases, as shown in the below chart. Those resources are pre-processed into MySQL tables by Perl scripts. On the web applicaation the information gathered from the users' input form is retrieved by PHP scripts, which use the provided parameters to select and filter out specific data from the local database tables and then store into temporary files to help in the making of the image maps. Finally, those PHP scripts also integrate those images and supplementary files into the different results summaries (PNG, HTML, PDF and ZIP files).







2.2 Data from External Sources


The following table summarizes all the information about the sources of the different marks, binding sites, SNPs, and eQTLs integrated in the web application MySQL database. This includes used database versions and links to the original resources. Some of the links point to a whole folder (like Promoters, Promoter Flanking, ...), because we downloaded the files from that folder. Links to UCSC refer to its "Table Browser" utility, further details of the queries can be found in the corresponding subsections below. The eQTL info is dinamically retrieved from GTEX site; now that site requires that users must register to acces the data, thus we provide the link to a tarball synchronized with this version.



DatasetOriginal SourceExperiment TypeVersion
TFUCSCChIP-seq, Conserved AlignmentVer. 2, hg19
GenesHGNCDNA-seq20170623
ExonsUCSCRNA-seqRelease 81
Promoters, Promoter Flanking, EnhancersEnsemblChIP-seqRelease 84
Histones H3K27ac, H3K4me1, H3K4me3, H3K9ac, H3K27me3, H3K9me3 ChIP-seqRelease 84, v2
ChromatinEnsemblChIP-seqRelease 84
SNP1000GENOMESWhole Genome Seq.Phase 3
eQTLGTExRNA-seq, Whole Exome Seq., Whole Genome Seq.Ver. 6
TSSFANTOM5CAGEPhase 1.3

2.3 Transcription Factors Track


The information was retrieved from the UCSC Table browser utility from the UCSC database. “Regulation group” table of the “TFBS Conserved” and “Txn Fac ChIP V2” tracks were selected. We cover 257 TFs having a total of 9,411,620 binding sites over human genome hg19.

Orange ribbons correspond to TF binding sites on the final plots (see graphical legend), whereas turquoise ones define CTCF-binding sites; overlapping TFBS are remarked as grey ribbons. We thought to highlight CTCF over other TF because it can function as transcriptional activator, repressor, or as insulator protein, blocking in that case the communication between enhancers and promoters. CTCFs can also recruit other transcription factors while bound to chromatin domain limits. The three-dimensional organization of the eukaryotic genome dictates their function, and CTCF serves as one of the core architectural proteins that help to establish this organization. The mapping of CTCF-binding sites in diverse species has revealed that the genome is covered with CTCF-binding sites.


ChromosomeNumber of Binding Sites
1784435
2710552
3560073
4404802
5495943
6502298
7438396
8370319
9349948
10386739
11428152
12386399
13209853
14274567
15274662
16294226
17360421
18181535
19246426
20226063
2187308
22131166
X279852
Y5340

Next table shows each TF and the number of binding sites over the genome that are annotated for it.


ChromosomeNumber of Binding Sites
AHR23478
ALX156385
AP2A110107
ARID3A22255
ARID5B22590
ARNT42067
ATF122802
ATF266606
ATF322794
ATF620668
BACH142523
BACH218193
BATF31954
BCL11A19794
BCL322386
BCLAF18537
BDP1737
BHLHE4036362
BPTF26210
BRCA18440
BRF1254
BRF21342
CBX319423
CCNT216249
CDC5L44001
CEBPA103440
CEBPB184728
CEBPD10530
CHD116822
CHD231082
CP1A18219
CP1C8730
CREB31539
CREB142525
CTBP26266
CTCF161575
CTCFL11122
CUX1136507
DDIT320571
E2F140164
E2F27941
E2F37941
E2F425175
E2F57941
E2F624243
EBF148017
EGR155056
EGR24926
EGR31692
ELF143411
ELK125509
ELK45172
EP300150579
ESR147882
ESRRA1036
ETS113385
EZH214700
FAM48A4055
FOS157707
FOSL129793
FOSL242859
FOXA189059
FOXA239196
FOXC135418
FOXD124218
FOXD345032
FOXF229636
FOXI145016
FOXJ2109231
FOXL144627
FOXM120902
FOXO155708
FOXO322019
FOXO3B22019
FOXO440964
FOXP225726
GABPA26297
GATA1125524
GATA2106261
GATA395466
GATA66323
GRp20471
GTF2B2481
GTF2F113271
GTF3C23692
HDAC110245
HDAC231678
HDAC61097
HDAC81502
HERPUD112182
HLF30168
HMGN311736
HNF1A69689
HNF4A56950
HNF4G19117
HOXA933827
HSF118314
HSF210242
IKZF17458
IRF146184
IRF31500
IRF416993
IRF728489
IRF924367
JUN119025
JUNB29262
JUND112358
KAP125781
KDM5A1571
KDM5B12680
KLF1212471
LHX3A54930
LHX3B54930
LMO233098
MAFF44694
MAFK81783
MAX95072
MAX128821
MAZ40806
MBD45330
MEF2A166829
MEF2C8716
MEIS157859
MTA310730
MXI133344
MYB15341
MYBL215385
MYC126480
MYOD113287
MZF121871
NANOG5274
NCX16036
NF136089
NFATC139949
NFATC230087
NFATC330087
NFATC430087
NFE221820
NFE2L169938
NFIC37762
NFIL335377
NFKB28217
NFKB133841
NFKB213651
NFYA28513
NFYB16639
NKX2-233653
NKX3-144631
NKX6-151513
NR2C24168
NR2F237173
NR3C178186
NRF17177
Oct-B235855
Oct-B335855
PATZ19743
PAX229477
PAX559345
PAX633212
PBX161343
PBX39564
PHF816954
PITX220117
PML22414
POLR2A134379
POLR3G209
POU2F1298081
POU2F260952
POU2F2B35855
POU2F2C35855
POU3F125648
POU3F2142788
POU5F13876
PPARG76764
PPARGC1A1185
PRDM14335
RAD21116361
RBBP519016
RCOR146432
RDBP387
REL12613
RELA45766
REST75114
RFX133259
RFX522831
RORA59296
RPC1552735
RREB115420
RSRFC451793
RUNX143595
RUNX363799
RXRA18355
SAP308184
SETDB121799
SIN3A21801
SIN3AK2036564
SIRT61976
SIX57629
SMARCA43269
SMARCB17975
SMARCC17312
SMARCC22355
SMC352888
SOX931813
SP158374
SP25150
SP45073
SPI166266
SREBF129794
SREBP132041
SRF64713
SRY87826
STAT152871
STAT218486
STAT397578
STAT421489
STAT5A56012
STAT5B12510
STAT617151
SUZ125688
TAF139404
TAF710811
TAL152829
TAL1B15821
TBL1XR119804
TBP134233
TCF1242578
TCF3106131
TCF46258
TCF7L242979
TEAD455961
TFAP2A15950
TFAP2C33978
TFAP431307
TFIID35107
TGIF126564
THAP13107
TOPORS20947
TP5337139
TRIM2811116
UBTF12402
USF181641
USF222050
VSX242000
WRNIP112796
XBP115560
YY195933
ZBTB3312204
ZBTB615177
ZBTB7A23808
ZEB170862
ZIC25082
ZKSCAN13655
ZNF14341372
ZNF2178903
ZNF26326732
ZNF2741837
ZZZ3803

2.4 Annotation Track


Light green ribbons represent the length of genomic locus defined by the user on the first panel of the web form; the query segment is defined by a pair of coords delimiting the gene (from its standard gene symbol, RefSeq or Ensembl identifiers) or the coordinates provided by the user. Orange and red ribbons refer to the 5'–flanking and 3'–flanking respectively. Blue ribbons represent a gene identified inside the input query (if exons are defined in the gene structure then this is the fill color for introns). Finally, dark green boxes inside the gene span represent exons. Gene names and Ensembl identifiers were retrieved from HGNC. Exon information and RefSeq identifiers were retrieved from UCSC Table Browser, from the “Genes and Gene Predictions” table of the “RefSeq Genes” track. We include genic structure annotations for 18,918 gene loci.


ChromosomeNumber of Genes
11983
21188
31029
4727
5785
61000
7862
8651
9742
10707
111250
12994
13308
14585
15559
16798
171125
18261
191363
20517
21213
22420
X804
Y46

2.5 Regulation Track


Data was retrieved from Ensembl Regulatory Database, release 84. Pink ribbons show promoter flanking regions, light purple ones depict enhancers, and beige ribbons correspond to promoters. We have included 100,483 promoter flanking regions sites, 140,349 enhancer sites, and 20,954 promoter sites.


ChromosomeEnhancersPromotersPromoter Flanking
1 13947 2126 9332
2 12526 1419 8969
3 9575 1137 7124
4 7120 781 5239
5 8637 940 6174
6 8603 1029 6434
7 7484 1010 5150
8 6643 769 4954
9 6233 874 4421
10 6663 807 5057
11 6385 1158 4913
12 6930 1088 4969
13 3849 354 2757
14 3922 665 3031
15 6221 741 3459
16 3859 947 2945
17 5074 1270 3574
18 3258 362 2375
19 2290 1487 1917
20 3492 530 2593
21 1738 248 1244
22 2109 521 1657
X 3568 662 119
Y 223 29 2076

2.6 Histone Marks Track


Epigenetics data was downloaded from ENCODE. Different histone marks were mapped in order to show which modifications could reaffirm the promoter and enhancer sites mapped from the Ensembl database as functional. The histone modifications track includes: H3K9ac and H3K4me3, both correlated with promoter regions; and H3K4me1 and H3K27ac, both correlated with enhancer sites. Briefly, the H3K27ac summarizes 1,126,338 sites; H3K4me1, 1,516,873; H3K4me3, 420,127; and H3K9ac, 1,270,829 sites.

The post-translational histone modifications, H3K27ac and H3K4me1 are found to be representative of active enhancers. When they are both found in the same region and simultaneously in that region there is a mark for an enhancer, this can be considered as an evidence for that region to be transcriptionally active. Moreover, if a SNP disrupts both histone marks and the enhancer, this is a strong indicator that this SNP could affect the normal functioning of such a region. The same can be said about the H3K9ac/H3K4me3 pair, yet they are representative of active promoters. The number of blue and yellow marks represents the number of cell lines having those marks. By default, when no cell line is selected on the input form, then the search is made over all cell lines, and those having the mark will be drawn. On the opposite, if only one cell line is selected, if the mark is present in the given loci then it will be drawn as one thicker band, in fact representing only that cell. By setting the default web form parameter to search in all cell lines, the corresponding blue and yellow ribbons can provide a hint on the relative abundance of those marks. Again, filtering by cell line in the histone track will end up showing the marks that the cell line has and the corresponding ribbons will be thicker just to facilitate its visualization.


ChromosomeH3K27acH3K4me1H3K4me3H3K9ac
1 140012 171735 45193 156093
2 90621 126351 30558 100743
3 77773 108356 24286 85239
4 58391 83143 16860 64870
5 62125 86460 20062 68668
6 72118 97625 20782 78150
7 61877 85020 21394 73772
8 52097 72900 17617 59911
9 49842 67468 16154 57638
10 56975 78938 16266 60599
11 50031 68394 24423 59678
12 49370 67646 21898 57829
13 31216 44968 7816 32465
14 29645 38891 14031 32020
15 31275 42353 15343 37166
16 30270 40828 17005 36627
17 38362 48927 24405 49699
18 24928 35354 7713 26729
19 20690 25124 22219 32200
20 29749 38106 10579 30429
21 14055 18850 5381 15195
22 18069 23366 10075 21751
X 35521 44703 9734 905
Y 1326 1367 333 32453

2.7 Chromatin State Track


Data was obtained from Ensembl. In this track "open chromatin" feature is highlighted in dark green, H3K27me3 in purple, and H3K9me3 in light blue. The latter histone mark is related to heterochromatin yet it contains much less marks annotated; whilst the former is related to inactivation signals of the chromatin. Thus, we have considered appropriate to combine them in the same track. In sumary, there are 938,294 open chromatin sites, 3,236,553 H3K27me3 sites, and 11,882 H3K9me3 sites uploaded in our database.


ChromosomeH3K27me3H3K9me3Open Chromatin
1 299788 846 92657
2 275083 635 73012
3 192546 291 56690
4 135738 470 39113
5 188860 371 49153
6 181160 350 53933
7 184086 759 47741
8 176459 516 40085
9 130376 463 41224
10 170099 788 43946
11 177342 407 48032
12 156321 372 46187
13 80754 366 20305
14 99326 144 28244
15 110496 192 31924
16 113791 522 35960
17 123978 174 47024
18 79235 150 17826
19 66617 995 42471
20 123032 215 25830
21 32064 476 11614
22 53711 252 19845
X 71171 208 19871
Y 5765 644 1738

2.8 SNPs Track


Data was retrieved from the 1000GENOMES project. The track shows the SNPs that were mapped onto the genomic coordinates, mainly providing their location and dbSNP identifier. If a “master SNP” was provided in the initial form, the SNP label also contains the computed linkage disequilibrium (LD), the values are always shown within parentheses after the SNP identifier. Those cases without LD will show a "0" symbol inside the parentheses following the SNP identifier. SNPs remarked in green are also eQTLs, they are genomic loci that contribute to variation in expression levels of mRNAs. The “master SNP” from the initial form field will be shown in red. The current database covers 9,477,952 SNPs with MAF > 0,1 and 1,984,754 eQTLs.


ChromosomeSNPseQTLs
1 712252 183734
2 759768 159737
3 653605 126005
4 673897 111993
5 567962 110192
6 602491 161493
7 539563 118313
8 501332 85871
9 397515 75123
10 464782 97561
11 451561 108099
12 446102 100548
13 333915 49678
14 301404 62054
15 271720 71858
16 287229 66401
17 254629 81116
18 260751 34506
19 217232 86763
20 198537 42452
21 133947 24501
22 128945 38407
X 348566 ---
Y 62010 ---



top

3. Color Codes Description

Each genomic feature depicted in the resulting plots has a distinct color assigned to it in order to facilitate its visualization. You can find below the graphical legend where the colors and the relative position of each feature type can be found in the final plots. Furthermore, just a couple of subsections ahead, there is an annotated image that illustrates the placement of each feature in those final plots, taking as example AKT1 gene results.



3.2 Tracks Color Legend


The graphical legend summarizes the different tracks in the same order they appear in the results figures, starting from the left as the most inner track. For each track, the colors are annotated along with the genomic features represented in the plot. The graphical legend is embed in the HTML and PDF versions of the graphical results produced by this web application.




3.3 Overlapping Genes


The figure below represents the results obtained after searching the FCHSD1 gene. As it can be seen, there are three additional genes in the query region: RELL2, HDAC3 and ARAP3. Each gene has its exons (dark green) and introns (blue) drawn, and to be able to differentiate between overlapping exon region in this track each overlapping gene will occupy a higher subtrack region (RELL2 in this figure). Those genes that are also in that region, but are not overlapping, will be drrawn in the outermost part of this track (HDAC3, ARAP3 and FCHSD1).




3.1 Close-up View


Initial Query:

- Gene: AKT1
- Master SNP: rs1130214
- Flanking: 10/10
- Cells: ALL
- eQTL: ALL
- SNP Population: ALL

Features that can be appreciated in the figure below:

- SNPs line beneath the tracks facilitating the visualization of the SNP's possible relationship/effect on the different structures (in case it disrupts the other features).
- Master SNP is colored red (find it on the lower right corner of the figure).
- eQTLs are colored green and each SNP has in parenthesis its linkage desequilibrium (LD) with the master SNP, or 0 if it does not have LD.
- Genes are colored blue (introns) with dark green as exons on top. The name of the gene includes the strand direction within the parenheses ("+" meaning forward strand, "-" for reverse).
- 3'-flank is shown in red and 5' flank in orange. If the input is a gene name/Ensembl/RefSeq IDs and it is in reverse strand (-), the 5'-flank will always be on the right side of the figure, whilest if it is in the forward strand (+), the 5'-flank will appear on the left side; the selected region (in light green) always fall on the bottom half of the plot (just below the main AKT1 gene in this example). Selected regions from chromosome coordinates and direct SNP queries are set in forward strand and will have the 5'-flank on the left.
- Each 1kbp of the input has an orange tick indicating the current position (GRCh37/hg19).
Note from this figure zoom, that the pair H3K9ac and H3K4me3 are more abundant in promoters-rich regions, and the pair H3K27ac and H3K4me1 in enhancers or promoter-flanking regions (they include enhancers).


In the figure below, we used the same initial query as in the previous case, but we also filtered the Chromatin, Regulation and Histone Tracks by the A549 cell line. We can see that this cell line presents same marks as in the previous figure, but here they are thicker.





top

4. CRE-Region Description

Internally we distinguish between different CRE region types, as depicted in the figure below. A CRE region can be defined as a feature disrupting histone marks (either the ones correlated with promoters or enhancers), open chromatin and transcription factors, or histone and open chromatin, or transcription factor and open chromatin. The labels assigned and the criteria are also summarized in a table.






top

5. Usage Example: DDIT4 Gene

For this use case, we are going to play with gene DDIT4, located in chromosome 10 between 74,033,677-74,035,797bp (GRCh37/hg19). As master SNP we are going to use rs1053639, because in previous experiments we performed it was statistically significant (Mas S et al, 2015). Thus, by providing this SNP we will retrieve the linkage desequilibrium among it and all the other SNPs falling within DDIT4 genomic region.

5.1 Setting Genomic Locus


We searched by gene name (DDIT4), but the form also accepts RefSeq IDs (NM_019058 for our query gene), and Ensembl IDs (ENSG00000168209 in this case).


Similar result can be obtained by providing the genomic coordinates of this gene (for this example, this will be chr10:74033677-74035797). Please note that you cannot input a gene identifier (either a symbol, a RefSeq ID or Ensembl ID), simultaneously with a genomic region. It has to be one OR the other. The web application will always assume that the genomic coordinates define the sequence in the forward/positive strand, so that please ensure you provide them in that order.


The final step in the Genomic Locus panel can be to define lengths for the flanking sequences. The first number represents the upstream and the second one, the downstream region; for instance 7/5 means it will add 7kb upstream and 5kb downstream to the coordinates of our query genomic region. When retrieving the coordinates from the gene identifier in forward coordinates (start always smaller than end), thus depending on the gene annotated strand the upstream region is located at the starting position of the genomic region (forward genes), or at the end position of that region (reverse genes). We are going to provide for this example 10/10 as the upstream/downstream flanking distances (default are 5/5 kbp).


5.2 Providing Master SNP


As master SNP we will use rs1053639, because in our previous experiments we found it as statistically significant (Mas S et al, 2015). Represented in the below figures you can find two different ways to provide the query for the same genomic region: "gene + flanking + master SNP" or "genomic coordinates + flanking + master SNP". Both of those two queries will produce same final results.


The web application also permits to input a "solo" master SNP, but then gene and genomic coordinates form fields MUST be empty. By default, given the position of the SNP introduced the application adds 2kb upstream and downstream of it (as the SNPs themselves are defined as a very short genomic segment, a single nucleotide for a nucleotide substitution), plus 5 kb as flanking regions in 5' and 3'. User can customize this too, as it has been shown for gene queries before.


5.3 Applying Cell Filters


Next step will be to select a cell line of interest for each of the tracks provided in the third panel of the input form (Chromatin State, Regulation, and Histone Marks). On each track, cell lines that are missing are marked in gray color and italic font. For example, the cell line A549 that is present in all tracks will produce as a result a more robust image. Note that if there is no information for a given set of cell lines (or for that genomic segment) the correpsonding tracks will remain empty in the final figure. By default the form will select all the cell lines. This can give the user a broad idea of where the promoter, histone marks, and chromatin features are located. Filtering by a specific cell line then, will give her clues about what features can be relevant for the regulation of the gene in that condition.


5.4 Choosing Variation


The last panel of the input form allows filtering by eQTL tissue and population. For the current example, we will filter now by Thyroid tissue and EUR population.


5.5 Results Overview


For this guided example, the web application mapped 36 SNPs within the query's coordinates, 20 of which are eQTLs. Furthermore, those can be subclassified in the following types: 5 ecreSNP, 2 creSNP, 15 eSNP, and 14 normSNP.
The master SNP is located in the last exon of the gene, inside a promoter mark and in between multiple TF binding sites including YY1 and POLR2A. The former can either activate or repress gene expression and the latter is part of the subunit alpha of the RNA pol II (see labeled circle D highlight on the zoomed image).




If we zoom in, and focus on the master SNP (A) of this analysis, we can see that it has other SNPs with a high LD and most of them are also eQTLs. From those SNPs, rs4747242 (B) could be a new candidate for a functional study. It has high LD with the master SNP, it is an eQTL, disrupts the histone marks associated with promoter marks (C), and also disrupts TFBS of YY1 and POLR2A (D).


top

6. SNPs Example Queries

In this last section, the examples will focus at three SNPs from different genes, looking for clues given from the resulting images and the classification table (either from the HTML or the PDF report versions).



6.1 rs6755102 (GAD1 gene)

This SNP is located in the 5'-UTR of the GAD1 gene and in an extremely repressed promoter as reported by Mitchell et al, 2015, specially in schizophrenic patients. Moreover, the same repression can be seen in the resulting image, indicated by the dark violet bands in the second to last track.

rs6755102 has also been reported as a functional SNP located in the GAD1 gene by Du et al, 2008, and forming an haplotype with other two SNPs (rs3762556 and rs3791878). The classification table resulting from searching the GAD1 gene and rs6755102 as master SNP, classifies these SNPs as ecreSNP (highest level) for the former and creSNP for the latter. Finally, SiNoPsis also classifies rs6755102 as creSNP, disrupting both H3K9ac and H3K4me3 (histone marks correlated with promoters), the promoter mark in the Regulation Track, and 20 transcription factors, including POLR2A and CTCF as it can be seen in the figure below highlighted by the red line.

Knowing that rs6755102 is a functional SNP, a researcher can use our web application and get more SNP candidates searching this as master SNP as we did in this query. Based on the results obtained in this query, the next SNPs could be interesting to include in a functional study based on their classification and linkage desequilibrium: rs3762560 (ecreSNP, 0.6950); rs3791875(ecreSNP, 0.7246); rs3791879 (ecreSNP, 0.6979); and rs4668324 (ecreSNP, 0.6786).

6.2 rs456998 (FCHSD1 gene)


Starting from our previous study of a predictor of extrapyramidal symptoms induced by antipsychotics (Mas et al, 2015), the rs456998 (A in the figure below) variant was found to have strong interaction will others three SNPs from other genes in the same mTOR pathway. To improve the predictor it will need to add more variants or validate the ones found. In order to do that, as a first analysis, one can use SiNoPsis to get ideas of new variants related to the one found. We performed the query with the FCHSD1, 10/10 flanking, and with rs456998 as master SNP.

As it can be seen in the figure below, we get additional clues about two other SNPs:

rs1421896 (B in the figure below) is depicted inside an open chromatin region, promoter mark and histones marks correlated with promoters as well as inside multiple TF binding sites, near he first exon of the HDAC3 gene. Given that has LD with our master SNP, and the gene it could affect has a wide range of effects, it makes us think that maybe the interaction that we discovered from our master SNP could be due to the effects that rs1421896 has on HDAC3 to continue our studies it could be interesting to include it in a next downstream experiment as this SNP could affect the expression of FCHSD1 gene directly of through modification of the HDAC3 gene which indirectly could modify FCHSD1.

rs12655779 (C in the figure below) only disrupts an enhancer region and its an eQTL. This SNP could affect the promoter region of the FCHSD1 gene in two ways, one through 3D remodeling of the chromatin of the enhancer region and another through the eQTL property of the SNP.



6.3 rs33925946


If we find ourselves without a master SNP, because the study is new and there are no functional SNPs studied until now, we could also use the web application to start including possible functional SNPs into experimental study. We introduced the AKT1 gene from the mTOR pathway, with 10/10 flanking regions. At first sight, we can appreciate that there are two more genes in our query that are transcribed from the + strand, unlike AKT1, from the - strand. Also, there are three promoter marks in multiple cells, two of them belonging to AKT1 promoter and one to the ZBTB42 gene.

This is a list with the following SNPs that could be introduced in an experimental study, given that they disrupt the histone marks correlated with promoters and the proper promoter mark plus TFBS and some are eQTLs:

- rs1130233: ecreSNP (A)
- rs74090038: creSNP (D)
- rs2494750: creSNP (D)
- rs10138227: ecreSNP (E)

The following SNPs could be selected based on they position inside histone marks correlated with enhancers and enhancer regions:

- rs73364507: creSNP (B). This SNP is inside an enhancer mark, but it does not disrupt the histone marks, but is 500bp away from said marks.
- rs33925946: creSNP (C). This SNP disrupts the histone marks, and a promoter flanking region, considered to have enhancer regions inside. This could be interesting to see if affects the expression of AKT1, given that it has all the characteristics to be inside an enhancer region.




top