SHOGoiN scMontage
Downloads


Installation


In Linux, just type the following commands:
$ ./configure
$ make
$ make install
$ make
$ make install

By default, "make install" will install all the files in "/usr/local/bin", "/usr/local/lib" etc.
You can specify an installation prefix other than "/usr/local" using "--prefix" to "configure" execution, for instance "--prefix=$HOME".
$ ./configure --prefix=$HOME
Running

Prepare a database file in which gene expression files in CM format are just concatenated as follows:
db.CM
db.CM
>GSM1269135 |GPL16791 (HiSeq 2500)|H.sapiens|8000301-738 (Muscle, Myoblast_T24 (Skeletal muscle))
ENSG00000000003:469 ENSG00000000005:0 ENSG00000000419:0 ...
...
ENSG00000283122:0 ENSG00000283123:8 ENSG00000283125:0
>GSM1269137 |GPL16791 (HiSeq 2500)|H.sapiens|8000301-738 (Muscle, Myoblast_T24 (Skeletal muscle))
ENSG00000000003:237 ENSG00000000005:0 ENSG00000000419:45 ...
...
ENSG00000283122:0 ENSG00000283123:0 ENSG00000283125:0
GSM1269130 |GPL16791 (HiSeq 2500)|H.sapiens|8000301-738 (Muscle, Myoblast_T24 (Skeletal muscle))
...
ENSG00000000003:469 ENSG00000000005:0 ENSG00000000419:0 ...
...
ENSG00000283122:0 ENSG00000283123:8 ENSG00000283125:0
>GSM1269137 |GPL16791 (HiSeq 2500)|H.sapiens|8000301-738 (Muscle, Myoblast_T24 (Skeletal muscle))
ENSG00000000003:237 ENSG00000000005:0 ENSG00000000419:45 ...
...
ENSG00000283122:0 ENSG00000283123:0 ENSG00000283125:0
GSM1269130 |GPL16791 (HiSeq 2500)|H.sapiens|8000301-738 (Muscle, Myoblast_T24 (Skeletal muscle))
...
Generate index file, "db_geneIds.txt".
$ ./genIndex.pl db.CM | sort | uniq > db_geneIds.txt
Generate binary file of the database, "db.bin".
$ ./runGerIndexer db_geneIds.txt < db.CM > db.bin

Prepare query file in CM format and run scMontage profile matcher as follows.
$ ./runGerMatcher db.bin query.CM > result.txt

The following is an example usage of scMontage profile matcher.
The database file "HiSeqHsapiens.bin" and the query file "query.GSM1901473.TF_activity_protein_binding.CM" can be downloaded from scMontage_Database. The query file contains "MF: transcription factor activity, protein binding" genes. The profile matching is performed using only the genes included in both the database and the query.
$ ./runGerMatcher HiSeqHsapiens.bin query.GSM1901473.TF_activity_protein_binding.CM > result.txt

The search result (using scMontage database version 1.0.1 in August 2018) is shown in the following table consisting of five columns:
· 1st column: Sample ID. Sample accessions numbers (GSM) of NCBI Gene Expression Omnibus (GEO) are used in scMontage database file.
· 2nd column: P-value of Fisher's Z-transformed rank correlation coefficient. The details of the derivation of the p-values are described in Document manuals.
· 3rd column: Spearman's rank correlation coefficient. The details of the derivation of the correlation coefficients are also described in Document manuals.
· 4th column: The number of genes used for the profile matching.
· 5th column: Header information of CM format in database file. In the scMontage database files, GEO's accession numbers of sample (GSM), GEO's accession numbers of platform (GPL), organism, and SHOGoiN Cell IDs of the samples delimitated by "|" are given.
· 1st column: Sample ID. Sample accessions numbers (GSM) of NCBI Gene Expression Omnibus (GEO) are used in scMontage database file.
· 2nd column: P-value of Fisher's Z-transformed rank correlation coefficient. The details of the derivation of the p-values are described in Document manuals.
· 3rd column: Spearman's rank correlation coefficient. The details of the derivation of the correlation coefficients are also described in Document manuals.
· 4th column: The number of genes used for the profile matching.
· 5th column: Header information of CM format in database file. In the scMontage database files, GEO's accession numbers of sample (GSM), GEO's accession numbers of platform (GPL), organism, and SHOGoiN Cell IDs of the samples delimitated by "|" are given.
Sample ID | P-value of Fisher's Z-transformed rank correlation coefficient | Spearman's rank correlation coefficient | # genes used for matching | Header information of CM format in database file |
---|---|---|---|---|
GSM1901473 | 0 | 1.00 | 588 | GSM1901473 |GPL11154 (HiSeq 2000)|H.sapiens|3110001010000000000000-020 (Pancreas, Alpha cell (Pancreatic islet)) |
GSM1901487 | 3.62266e-13 | 0.556442 | 588 | GSM1901487 |GPL11154 (HiSeq 2000)|H.sapiens|3110001010000000000000-020 (Pancreas, Alpha cell (Pancreatic islet)) |
GSM1901493 | 7.22888e-12 | 0.544295 | 588 | GSM1901493 |GPL11154 (HiSeq 2000)|H.sapiens|3110001010000000000000-020 (Pancreas, Alpha cell (Pancreatic islet)) |
GSM1901488 | 1.94848e-10 | 0.529727 | 588 | GSM1901488 |GPL11154 (HiSeq 2000)|H.sapiens|3110001010000000000000-020 (Pancreas, Alpha cell (Pancreatic islet)) |
GSM1901458 | 3.72947e-10 | 0.526685 | 588 | GSM1901458 |GPL11154 (HiSeq 2000)|H.sapiens|3110001010000000000000-212 (Pancreas, PP cell (Pancreatic islet)) |
GSM1901497 | 1.09923e-09 | 0.521478 | 588 | GSM1901497 |GPL11154 (HiSeq 2000)|H.sapiens|3110001010000000000000-020 (Pancreas, Alpha cell (Pancreatic islet)) |
GSM1901464 | 1.62965e-09 | 0.519536 | 588 | GSM1901464 |GPL11154 (HiSeq 2000)|H.sapiens|3110002050000000000000-090 (Pancreas, Duct cell (Pancreatic islet)) |
GSM1901519 | 3.53105e-09 | 0.515645 | 588 | GSM1901519 |GPL11154 (HiSeq 2000)|H.sapiens|3110001010000000000000-026 (Pancreas, Beta cell (Pancreatic islet)) |
... | ... | ... | ... | ... |