The core of current version of SIGNET is to use the two-stage penalized least squares (2SPLS) method proposed by Chen et al. (2018) to construct genome-wide gene regulatory networks (GW-GRNs). An application of 2SPLS to yeast data can be found in Chen et al. (2019).
While current SIGNET constructs the GW-GRN with transcriptomic and genotype data collected from on population, we are developing SIGNET to simultaneiously construct and compare GW-GRNs for two or more populations for the purpose of (i) more powerful to establish comment reguations shared across different populations; (ii) more effectively identify population-specific (e.g., cancer-specific) gene regulations.
SIGNET runs on a UNIX bash shell. Check your shell with echo $SHELL
to make sure that you are running on UNIX bash shell. SIGNET uses the Slurm Workload Manager for high performance computing (HPC) clusters in its stage of constructing the gene regulatory network in parallel.
First you should clone the directory to the path in your server and add the path where you install the software to enable directly running the command without specifying a particular path.
git clone https://github.itap.purdue.edu/jiang548/SIGNET.git
cd SIGNET
export PATH=/path/to/signet:$PATH
where /path/to/signet
should be replaced with your path to SIGNET.
SIGNET runs dependent on several packages such as PLINK, IMPUTE2, and R (with its libraries). While you may install all of these packages by yourself, we also provide a Singularity container signet.sif
which packs all the packages required by SIGNET. The Singularity container signet.sif
provides an environment in which SIGNET can smoothly run, so you don't have to separately install any of the required packages for SIGNET.
Before having the Singularity container signet.sif
, first you have to install Singularity following https://sylabs.io/guides/3.8/user-guide/quick_start.html#quick-installation-steps.
You can pull the image from our repository and rename it as signet.sif
, after which you can append the path of package to singularity so it can execute SIGNET smoothly. You may also need to bind a path in case container doesn't recognize your file. The environment variables have to be exported everytime you start a new terminal.
singularity pull library://geomeday/signet/signet:0.0.5
export SINGULARITYENV_APPEND_PATH="/path/to/signet"
export SINGULARITY_BIND="/path/to/bind"
where /path/to/signet
should be replaced with your path to SIGNET, and /path/to/bind
should be replaced with the desired bath to bind.
You can use the image by attaching a prefix ahead of the original commands you want to execute, which are described in details in sections below.
singularity exec signet.sif [Command]
e.g.
singularity exec signet.sif signet -s
Or you could first shell into the container by
singularity shell signet.sif
and then execute all the commands as usual.
Caution: All the intermediate result for each step will by default return to the corresponding folders in the tmporary directory starting with 'tmp' and all the final result will return to the result folders starting with 'res'. You could also change the path of result files in the configuration file named config.ini, or use signet -s described below. Please be careful if you are using the relative path instead of the absolute path. The config.ini will record the path relative to the folder that SIGNET is installed, in order to reach file mangement consistency. It's highly recommended to run command where signet is installed. In each of the process, you could specify the result path, and you will be asked to whether purge the tmporary files, if you already have those. It's also suggested you keep a copy of the temporary files for each analysis, in case you need them in later steps. Please run each analysis at a time under the same folder, as a latter process will overwrite the previous tmporary files.
This streamline project provide users easy linux user interface for constructing whole-genome gene regulatory networks using transciptomic data (from RNA sequence) and genotypic data.
Procedures of constructing gene regulatory networks can be split into six main steps:
- genotype preprocess
- gene expression preprocess
- adjust for covariates
- cis-eQTL analysis
- network analysis
- network visualization
To use this streamline tool, user need first to prepare the genetype data in vcf format. Then set the configuration file properly, and run each step command seperately.
We highly recommand you to prepare the gene expression data and genotype data first, and place them to a specific data folder, to organize each step as it may involve many files:
Click here for more detail about genotype and genexpression dataset
Here we set the number of autosomes to 22, so the chromosomes we study are 1-22.
signet -s --nchr 22
We can use the command to check below to check autosome number
signet -s --nchr
That is, when no value is provided, we will display the value of the specified parameter. We can also use
signet -s
to display the values of all parameters. We may also provide a way to reset the value of one parameter or all parameters to default values.
signet -s --d
or
signet -s --nchar --d
For preprocessing genotype data
signet -g
For preprocessing transcriptomic (gene expression) data
signet -t
For cis-eQTL analysis.
signet -c
For network construction.
signet -n
For network visualization.
signet -v
Please note that you have to run genotype preprocessing before gene expression preprocessing if you are using the GTEx cohort
signet -s
command is used for look up and modify parameter in the configuration file config.ini. You don't have to modify the parameters at the very beginning, as you will have options to change your input parameters in each step.
click here for detailed introduction for configuration file.
signet -s [--PARAM] [PARAM VAL]
--PARAM list the value of parameter PARAM
--PARAM [PARAM VAL] modify the value of parameter PARAM to be [PARAM VAL]
# list all the parameters
signet -s
## echo: all the current parameters
# List the paramter
signet -s --nchr
## echo: 22
# Replace s with settings would also work
signet -settings --nchr
# Modify the paramter
signet -s --nchr 22
## echo: Modification applied to nchr
# Set all the parameters to default
signet -s --d
## echo: Set all the parameters to default
# If you input wrong format such as "-nchr"
signet -s -nchr
echo: The usage and description instruction.
# If you input wrong name such as "-nchro"
echo: Please check the file name
(TCGA)
The command signet -t
will take the matrix of base-2 logarithm transformed gene count data and preprocess it. Each row represents the data for each gene, and each column represents the data for each sample, while the first row is the sample name, and the first column is the gene name. Note that the last 5 rows are not considered in the analysis since they contain ambigous gene information by UCSC.
In this step, we will filter out genes with total counts less than 2.5 million according to NIH standard and are counted in less than 20% of the samples, after which we will apply variance stablizing transformation by DESeq2 to normalize data. Furthermore, we will only focus on protein coding genes.
signet -t [--g GEXP_FILE] [--p MAP_FILE]
--g | --gexp gene expression file
--p | --pmap genecode gtf file
--restrict restrict the chromosomes of study
--r | --rest result prefix
gexp
: the base-2 logarithm transformed count data for genes as a matrix with the first column containing the ENSEMBEL ID, the first row containing sample names, the rest rows include data for genes and rest columns encode data for samples (however, the last 5 rows are not included in the following analysis since they contain ambigous gene information by UCSC);pmap
: collapsed genecode v22 gtf file, an annotation file which combines all isoforms of a gene into a single transcript. The detailed information could be found in collapsed gene model;restrict
: specifing chromosome(s) of interest, which may be dash separated, e.g. 1-22; comma separated, e.g. 1,2,3; or simply a number, e.g. 1.
Output of gexp-prep
will be saved to res/rest
.
signet_gexp
: gene expression data after pre-processing.signet_gene_name
: corresponding gene name.signet_gene_pos
: correspongding gene position.signet_gexpID
: correspdonding sample ID.
# List the paramter
signet -t --help
## Display the help page
# Modify the paramter
signet -t --g data/gexp-prep/TCGA-LUAD.htseq_counts.tsv \
--p data/gexp-prep/gencode.v22.gene.gtf \
--restrict 1
## The preprocessed gene expresion result with correpsonding position file will be stored in /res/rest/
(GTEx)
signet -t [--r READS_FILE] [--tpm TPM_FILE]
--r | --read gene reads file in gct format
--t | --tpm gene tpm file
--g | --gtf genecode gtf file
--rest result prefix
read
: GTEx reads file in gct format.tpm
: GTEx TPM file in gct format.gtf
: collapse gene code v26 gtf file.
# List the paramter
signet -t --help
## Display the help page
# Modify the paramter
signet -t --reads data/gexp/GTEx_gene_reads.gct \
--tpm data/gexp/GTEx_gene_tpm.gct \
--gtf data/gexp-prep/gencode.v26.GRCh38.genes.gtf
## The preprocessed gene expresion result with correpsonding position file will be stored in /res/rest/
(TCGA)
The command signet -g
provides the user interface of preprocessing genotype data. We first use PLINK to conduct quality control, filtering out samples and SNPs with high missing rates and filtering SNPs discordant with Hardy Weinberg equilibruim. We then use IMPUTE2 to impute missing genotypes in parallel.
signet -g [OPTION VAL] ...
--p | --ped ped file
--m | --map map file
--mind missing rate per individual cutoff
--geno missing rate per markder cutoff
--hwe Hardy-Weinberg equilibrium cutoff
--nchr chromosome number
--r | --ref reference file for imputation
--gmap genomic map file
--i | --int interval length for IMPUTE2
--ncores number of cores
--resg result prefix
ped
: includes pedgree information, i.e.,[family_ID, individual_ID, mother_ID, father_ID, gender,phenotype] in the first six columns, followed by 2p columns with two columns for each of p SNPs;map
: includes SNP location information with four columns, i.e., [chromosome, SNP_name, genetic_distance,locus] for each of p SNPs;mind
: missing rate cutoff for individuals, a value in [0, 1]. By default 0.1;geno
: missing rate cutoff for SNPs, a value in [0, 1]. By default 0.1;hwe
: Hardy-Weinberg equilibrium cutoff, a value in (0, 1]. By default 10^-4;ref
: reference file for imputation, can be downloaded from website of IMPUTE2;gmap
: genomic map file for imputation, can be downloaded from website of IMPUTE2;int
: a positive number specifying the interval length for imputation. By default 10^6;ncores
: an integer larger than 1 specifying number of cores in the current server. By default 20.
# List the paramter
signet -g --help
## Display the help page
# Modify the paramter
signet -g --ped data/geno-prep/test.ped \
--map data/geno-prep/test.map \
--ref data/ref_panel_38/chr \
--gmap data/gmap/chr
Two files will be generated from preprocessing the genoytpe data (The filename begins with signet
by default, you are able to customize it by setting an additional flag --resg
. e.g. --resg res/resg/[younameit]
):
signet_Geno
: Genotype data with each row denoting the SNP data for one subject;signet_Genotype.sampleID
: Sample ID for each individual, which uses the reading barcode.
(GTEx)
signet -g
command provide the user the interface of preprocessing genotype data. We will first extract the genotype data that has corresponding samples from gene expression data for a particular tissue.
Output of geno-prep
will be saved under /res/resg
:
signet -g [OPTION VAL] ...
--vcf0 set the VCF file for genotype data before phasing
--vcf set the VCF file for genotype data, the genotype data is from GTEx after phasing using SHAPEIT
--read set the read file for gene expression read count data in gct format
--anno set the annotation file that contains the sample information
--tissue set the tissue type
vcf
: includes SNP data from GTEx v8 before phasing in vcf formatvcf0
: includes SNP data from GTEx v8 after phasing in vcf formatread
: gene count data in tpm formatanno
: GTEx v8 annotation filetissue
: tissue type, lower/upper case must exactly map to what is included in the annotation file
# Set the cohort
signet -s --cohort GTEx
# Modify the paramter
signet -g --vcf0 data/geno-prep/Geno_GTEx.vcf \
--vcf data/genotype_after_phasing/Geno_GTEx.vcf \
--read data/gexp/GTEx_gene_reads.gct \
--anno data/GTEx_Analysis_v8_Annotations_SampleAttributesDS.txt \
--tissue Lung
The command signet -a
provides users the interface to match genotype and gene expression files, calculate principal components (PCs) for population stratification, adjust for covariates effect by top PCs, races and gender. Then calculate the minor allele frequency (MAF).
Note that signet -a
reads the output from signet -g
and signet -t
.
output of adj
will be saved under /res/resa
:
(TCGA)
c
: clinical file from TCGA project. Should contain at least a column of submitter id.
signet -a [--c CLINIVAL_FILE]
--c | clinical clinical file for your cohort
--resa result prefix
c
: specifying a TSV file including clinical information, with at least columns of submitter id, gender and race in TCGA data.
signet -a --c ./data/clinical.tsv
Output of adj
will be saved to res/resa
:
signet_geno.data
: matched genotype data, with rows representing samples and columns representing SNPs.signet_gexp.data
: matched gene expression datat o be used further for network analysis, adjusted for covariates but don't include PCs, with rows representing samples and columns representing gene expressions.signet_gexp_rmpc.data
: matched gene expression data to be used further for cis-eQTL analysis, adjusted for covariates including PCs, with rows representing samples and columns representing gene expressions.signet_matched.gexp
: matched gene expression data, without ajusting for covariates, with rows representing samples and columns representing gene expressions.signet_new.Geno.maf
: MAF file for genotype data.signet_new.Geno.map
: MAP file for genotype data.
(GTEx)
signet -a [--p PHENOTYPE_FILE]
--pheno GTEx phenotype file
--resa result prefix
-pheno
: phenotype file from the GTEx v8.
signet -a --pheno \
./data/pheno.txt
Now that we have completed all necessary preprocessing, normalization, and data cleaning, we are ready to perform cis-eQTL mapping. If you want to construct GRN with your own data (rather than TCGA or GTEx data), you should preprocess your data by yourself (instead of above functions provided by SIGNET) and then use SIGNET from this step.
Before you start this step, please make sure that you have the following files ready:
- Gene expression file including preprocessed gene expressions matched with preprocessed genotype data, adjusted for all covariates including top PCs (for
--gexp
); - Gene expression file including preprocessed gene expressions matched with preprocessed genotype data, adjusted for all covariates other than top PCs (for
--gexp.withpc
); - Genotype file including SNP minor allele count data as a matrix of values 0, 1, 2, with each row encoding for one subject and each column encoding for one SNP (for
--geno
); - Map file including SNP positions as a matrix in .map file format;
- MAF file including SNP minor allele frequency data as one column where each value is the number of SNPs after preprocessing (for
--maf
); - Gene position information file with the first column specifying the gene name, second column specifying the chromosome index (e.g. "chr1"), the third and fourth columns specifying the start and the end positions of the gene, respectively (for
--gene_pos
).
Caution: Genes in the two gene expression files are arranged according to the order of genes in the gene position information file.
For each gene, we will use an adaptive rank sum permutaion test to identify its cis-eQTL as instrumental variables. Therefore, the possible instrumental variables of a specific gene include any genetic variants within its coding region as well as upstream and downstream regions up to certain ranges which will be specified by options --upstream
and --downstream
, respectivly.
signet -c [OPTION VAL] ...
--gexp file of gene expressions adjusted for all covariates, matched with genotype data
--gexp.withpc file of gene expressions adjusted for all covariates other than top PCs, matched with genotype data
--geno file of genotype data matched with gene expression data
--map snps map file path
--maf snps maf file path
--gene_pos gene position file
--alpha | -a significance level for cis-eQTL
--nperms number of permutations
--upstream number of base pairs upstream the genetic region
--downstream number of base pairs downstream the genetic region
--resc result prefix
--help | -h user guide
gexp
: specifying the gene expression file including preprocessed gene expressions matched with preprocessed genotype data, adjusted for all covariates including top PCs;gexp.withpc
: specifying the gene expression file including preprocessed gene expressions matched with preprocessed genotype data, adjusted for all covariates other than top PCs;map
: specify the file including SNP positions as a matrix in .map file format;maf
: specify the MAF file inlcuding SNP minor allele frequency data as one column where each value is the number of SNPs after preprocessing;geno
: specifying the genotype file including SNP minor allele count data as a matrix of values 0, 1, 2, with each row encoding for one subject and each column encoding for one SNP;gene_pos
: specifying the gene position information file with the first column specifying the gene name, second column specifying the chromosome index (e.g. "chr1"), the third and fourth columns specifying the start and the end positions of the gene, respectively (for--gene_pos
).alpha
: specifying a value in (0, 1) as the significance level for identifying instrumental variables. By default 0.05;nperms
: specifying the number of permutations. By default 100;upstream
: specifying the number of base pairs upstream each genetic region. By default 1000;downstream
: specifying the number of base pairs downstream each genetic region. By default 1000.
Output of cie-eQTL
will be saved to res/resc
:
signet_net.Gexp.data
: is the expression data for gene expression, wo removing the PC by default.signet_net.genepos
: include the position for genes insignet_net.Gexp.data
, has four columns: gene name, chromosome number, start and end position, respectively.signet_cis.name
: genes with cis-eQTLs.signet_[common|low|rare|all].eQTL.data
: includes the genotype data for marginally significant [ common | low | rare | all ] cis-eQTL;signet_[common|low|rare|all].sig.pValue_alpa
: includes the p-value of each pair of gene and its marginally significant [ common | low | rare | all ] cis-eQTL, where Column 1 is Gene Index, Column is SNP Index incommon.eQTL.data
, and Column 3 is p-Value.signet_[common|low|rare|all].sig.weight_alpha
: includes the weight of collapsed SNPs for marginally significant cis-eQTL. The first column is the gene index, the second column is the SNP index, the third column is the index of collapsed SNP group, and the fourth column is the weight of each SNP in its collapsed group (with value 1 or -1).
signet -c --upstream 1000 \
--downstream 1000 \
--nperms 100 \
--alpha 0.05
The command signet -n
provides the tools for constructing a GRN using the two-stage penalized least squares (2SPLS) approach proposed by Chen et al. (2018). Note that the same set of data will be bootstrapped nboots
times and each bootstrapping data set will be used to construct a GRN. The frequencies of the regulations appeared in the nboots
GRNs will be used to evaluate the robustness of constructed GRN with higher frequency implying more robust regulation.
network
receive the input from the previous step, or it could be the output data from your own pipeline:
Caution Please make sure that you are using the SLURM system. Please also don't run this step inside a container, as the singularity container is integrated as part of the procedure. as it is integrated in the analysis.
signet -n [OPTION VAL] ...
--net.gexp.data gene expression data for GRN construction
--net.geno.data marker data for GRN construction
--sig.pair significant index pairs for gene expression and markers
--net.genename gene name files for gene expression data
--net.genepos gene position files for gene expression data
--ncis maximum number of biomarkers for each gene
--cor maximum correlation between biomarkers
--nboots number of bootstraps datasets
--memory memory in each node in GB
--queue queue name
--ncores number of cores to use for each node
--walltime maximum walltime of the server in seconds
--interactive T, F for interactive job scheduling or not
--resn result prefix
--sif singularity container
--email send notification emails after each stage is compeleted if you have mail installed in Linux, and interactive=F
net.gexp.data
: output fromsignet -c
, including the expression data of genes with cis-eQTL, a n*p matrix with each row encoding the gene expression data for one subject;net.geno.data
: output fromsignet -c
, including the genotype data of significant cis-eQTL, a n*p matrix with each row encoding the genotype data for one subject;sig.pair
: output fromsignet -c
, including the p-value of each pair of gene and its significant cis-eQTL with Column 1 specifying Gene Index (innet.Gexp.data
), Column specifying SNP Index (inall.Geno.data
), and Column 3 specifying p-value;net.genename
: gene name in a p*1 vector;net.genepos
: gene position as a p*4 matrix, with first column including gene names, second column including chromosome index (e.g, "chr1"), third and fourth columns including the start and end position of genes in the chromosome, respectly;ncis
: an integer as the maximum number of biomarkers associated with each gene. By default 3;cor
: a value in [-1, 1] specifying the maximum correlation between biomarkers. By default 0.6;nboots
: an integer as the number of bootstrapping data sets taken in calculation. By deault 100;queue
: a string for queue name in the cluster;ncores
: number of cores for each node. By default 128;memory
: memory of each node, in GB. By default 256;walltime
: maximum wall time for cluster. By default 4:00:00;sif
: a Singularity container, in .sif format;email
: the email address from which you can receive notification, with default valueF
meaning no notification and is only enabled where interactive=F.
signet_Afreq
: Ajacency matrix for final list of genes. A[i, j]=1 if gene i is regulated by gene j. 0 entry indicates no regulation.signet_CoeffMat0
: Coefficient matrix of estimated regulatory effect on the original data set.signet_net.genepos
: Corresponding gene name, followed by chromsome location, start and end position.
signet -n --nboots 100 \
--queue standby \
--walltime 4:00:00 \
--memory 256
signet -v
provide tools to visualize our constructed gene regulatory networks. Users can choose the bootstrap frequency threshold and number of subnetworks to visualize the network.
You should first ssh -Y $(hostname) to a server with DISPLAY if you would like to use the singularity container, and the result can be viewed through a pop up firefox web browser
signet -v [OPTION VAL] ...
--Afreq matrix of regulation frequencies from bootstrap results
--freq bootstrap frequecy for the visualization
--ntop number of top sub-networks to visualize
--coef coefficient of estimation for the original dataset
--vis.genepos gene position file
--id NCBI taxonomy id, e.g. 9606 for Homo Sapiens, 10090 for Mus musculus
--assembly genome assembly, e.g. hg38 for Homo Sapiens, mm10 for Mus musculus
--tf transcirption factor file, only sepecified for non-human data
--resv result prefix
--help usage
Afreq
: specifying the estimated bootstrap frequencies for regulations in a matrix with (row, column)-th entry encoding the frequency of row gene regulated by column gene;freq
: specifying the bootstrap frequency cutoff as a value in [0, 1]. By default 1;ntop
: specifying the number of top subnetworks to visualize. By default 1;coef
: specifying the file including the estimation of coefficients from the original data, a matrix with its (row,column) value for the row gene regulated by column gene;vis.genepos
: specifying the file including the position of genes to be visualized, a matrix with the first column including the gene name, second column including its chromosome index (e.g. "chr1"), the thrid and fourth column including its start and end positions in the chromosome, respectively;id
: specifying NCBI taxonomy id number, e.g,9606
for Homo Sapiens. By default9606
;assembly
: specifying Genome assembly. e.g,hg38
for Homo Sapiens. By defaulthg38
;tf
: specifying a file with the names of genes that are transcription factors, only needed for the study which is not for Homo Sapiens.
signet_edgelist*
: Edgelist file includes infromation for all regulation for given cutoff. Includes gene symbol, chromosme number, start and end posistion for both source and target gene, followed by bootstrap frequency and coefficient estimated from the original data.signet_top*.html
: HTML file for largest sub-networks visualization.signet_top*.name.txt
: Gene name list fo largest sub-networks, given bootstrap cutoff.
signet -v
config.ini file is under the main folder and saving the costomized parameters for all of the stages of signet process. Settings in config.ini are orgnized by different sections.
Users can change the SIGNET process by modifying the paramter settings in the configuration file.
# script folder save all the code
- script/
- gexp_prep
- geno_prep
- adj
- cis-eQTL
- network
- netvis