Last updated: 2017/01/08 21:42
http://weizhongli-lab.org/RNA-seq
Developed by Weizhong Li's lab at JCVI http://weizhongli-lab.org wli@jcvi.org liwz@sdsc.edu
The RNA-seq portal provides integrated computational tools and workflows for RNA-seq based gene expression analysis for agriculturally important animal species. To help researchers in data analysis, this portal is designed to allow to run end to end computational workflows for multiple samples with minimal efforts. The project is developed with support from NIFA (award #2013-67015-21428). To contact us, please visit our RNA-seq project page, where you can send us comments using the web form.
You can also start with this video clip on YouTube (~ 15 minutes).
The web portal to run RNA-seq analysis (see figure below) is implemented using Galaxy framework. Users new to Galaxy are recommended to learn Galaxy’s concept and know the basic usage, please visit Galaxy’s Wiki page and Galaxy wiki document for more information.
You don't need to be a Galaxy expert to use this portal. If you are completely new to Galaxy, you may find there are too much document from Galaxy project to read. You can just watch some screencasts. You don't need to spend more than 30 minutes.
In order to use this portal, please first to register. The registration is very simple: hit registration under User tab or click http://weizhongli-lab.org:8088/user/create and follow the instructions.Once you register, you will be able to access your data, analysis results from the Galaxy history panel, which appears on the right side of the browser.
After registration, please login to the portal. Some RNA-seq sequences are shared with all users as examples. From the top menu, click “shared data”, then click “Data libraries”, a new page will show. Click the “fruitfly-RNA-seq-example”, another page will load. check all the datasets by clicking checkbox next to “name” button, below these datasets, select “Import to current history” (which is default), and click “Go” button on the right. These datasets will be loaded to your history (see figure below).
These datasets are from paper by Trapnell et al. Nat Protoc. (2012) 7(3): 562–578. They are RNA-seq reads from fruitfly under two conditions (C1 and C2). There are 3 samples (R1, R2, R3) in each group, 6 samples total. Since these sequences are pair ended reads, there are 12 fastq files total. These samples were down-sized to 2.5 million reads per sample.
Next, let's run these samples using the tophat PE reads workflow (see figure below).
Depending on size of your job and number of jobs in the job queue, it may take hours or longer to finish. When the job is completed. The history will become green, click the “eye” icon, there will be a link on the page direct to the results (see figure below).
The workflow generates a lot of output files, such as alignments .bam files, gene model .bed files, gene expression data, assemblies, log files etc. All these files are available for download. You can download a compressed package containing all the files to your computer and analyze them locally, or you can browse and access the individual files from our RNA-seq server. The figure shows a sample job download page.
If you want to analyze the job results locally. You will also need to download the reference genomes, gene models from the RNA-seq portal.
The first step for your analysis is to get the data uploaded to the portal. All the workflows in this portal accept sequences in FASTQ format, either paired end (PE) reads or single end (SE) reads. For PE reads, you need to have two reads R1 and R2 in separate FASTQ files.
To upload your data, please first login to the web portal, click the “Get Data”, then “Upload File from your computer” link on left column, a new window will appear, where you can drag and drop fastq files (see figure below).
You can also upload your data to GenomeSpace and send it back to galaxy. Click the “Get Data” and then “GenomeSpace import from file browser”. You can see a new window of GenomeSpace. To do this, you need to register a GenomeSpace account first and upload your data into it, then select the data you want to upload, and click “Send to Galaxy”. You will go back to galaxy server and it may take a while to upload the data.
Most importantly, you can also upload large files (>2.0Gb) via our FTP server. If you are comfortable with Linux, you can access it through terminal like this.
$ftp 52.11.154.125 #Provide your galaxy email and password as Name and Password.
ftp> put PATH/TO/FastqFile uploaded-1.fastq
If you are not comfortable with Linux, then forget about it, since you can also log in by FileZilla and drag whatever you want to the ftp server. To do it, you need to first download FileZilla and it is free. Once you installed it in your computer, you can edit Site Manager to save our ftp server as one of your sites. Please make sure to set Timeout in seconds to “120” in Connection Settings and select “Only use plain FTP (insecure)” for Encryption, “Active” for Transfer mode.
Then Click “Connect” button and provide your password. Then you can log into FTP and drag the file from local site to remote site.
It is quite time-consuming if your files are extremely large. After the ftp transfer is over, you can log into our galaxy server to see the uploaded file in “Choose FTP file” and select the files submitted to galaxy.
After your file is submitted into galaxy, it will remove from FTP server.
Once the upload is completed, you will see them in your history at right column. Please hit the “Edit attributes” icon to add or modify the data. It is highly recommended to give your data a simple and meaningful name (e.g. “tissue1-replicate1-R1.fq”, “control-replicate-R1.fq” etc), avoiding using special characters.
Sometimes you may need to combine the reads from different lanes for each sample. You can use cat function in Linux, but you can also click “Text Manipulation”, and then “Concatenate datasets tail-to-head” to combines them.
The portal currently offers three end-to-end workflows. All these workflows are implemented to handle multiple samples and multiple groups. A workflow will perform identical process (e.g. mapping) for each individual samples, will compare the results between groups or samples, and may analyze data based on pooled samples or groups.
The Tophat, cufflink and cuffdiff workflow, also know as the TUXEDO Package, is one of the most widely used tools in RNA-seq data analysis. The workflow we implemented here is based on the pipeline described in paper by Trapnell et al. Nat Protoc. (2012) 7(3): 562–578.. The pipeline is shown in the figure below. Please refer to this paper for more information.
For each individual sample, the workflow runs the following process:
For all the samples together, the workflow runs
For any pair of two groups, the workflow runs
The workflow generates a lot of files, including
A example of the Tophat, cufflink and cuffdiff workflow is available from this link.
This workflow uses STAR, a ultrafast RNA-seq aligner, for reads mapping to reference genome.
For each sample, the workflow runs the following process:
For any pair of sample groups, the workflow runs differential analysis with EdgeR
The output from the workflow includes:
A example of the STAR workflow is available from this link.
This workflow is implemented according to the the Trinity paper in Nature Protocols, 8 (8), pp. 1494–1512. Please refer to this paper. Additional information about the protocol are described at http://trinityrnaseq.github.io.
This workflow runs Trinity for multiple samples from multiple groups. It first assembles the reads pooled from all the input samples into transcriptome, then it aligns the reads from each individual sample to the assembled transcriptome with bowtie and perform Post-assembly Transcriptome Analysis using RSEM.
The workflow generates a lot of files, including
A example of the Trinity workflow is available from this link.
The reference genomes, gene annotations and formatted genome are available from RNA-seq portal. You will need these data for further analysis of your workflow results. You can also download the reference genomes through our FTP server as anonymous users in terminal or FileZilla just as described above.
The Integrative Genomics Viewer (IGV) is a very popular tool for interactive exploration of large, integrated genomic datasets. IGV can be used to visualize the results generated by our workflows. In order to facilitate users to use IGV, we have per-formatted genome and annotations in IGV format for all the species in our portal. In order to allow IGV to access the genomes in our server, please configure IGV to set the genome server URL to http://weizhongli-lab.org/RNA-seq/Data/reference-genomes.txt (see figure below). Please also enable port “60151”.
After this configuration, you will be able to select the reference genome within IGV.
In our workflow, all the BAM files sorted, so you can directly load them into IGV. You can either download the files locally or directly load from our server (use File → load from URL options within IGV). You can visualize not only BAM, but also other type of files (e.g. BED files) generated from our workflows.
To be added.
To support users who prefer to run these workflows locally or want to setup web portal on their own servers, with the flexibility of using different parameters, our back-end software package is freely available.
The software package needs to be installed on generic Linux computer clusters that support Open Grid Engine. Here is a list of requirements:
The system was implemented based on many software tools. so it is important for the developers that maintain this software package to have experiences in following computer languages and systems:
The software package is available from http://weizhongli-lab.org/software/RNA-seq-portal-release.V0.0.1.tar.gz and ftp://weizhongli-lab.org/pub/software/. Please download and unpack the tar ball, e.g. RNA-seq-portal-release.V0.0.1.tar.gz to a folder (e.g. /home/oasis/RNA-seq-home, let's call it $RNA_SEQ_ROOT_DIR) in large shared file system if you already have a Linux cluster already. Otherwise, just unpack to a temporary working directory and move to large file system later. Please then follow the next steps.
Option 1 is to ask your IT department to setup a Linux cluster that support OGE and other requirements listed above.
Option 2 is to utilize Amazon cloud or other cloud providers e.g. Google and Microsoft. We have used Amazon to support our web portal and we share our experiences here:
[global] DEFAULT_TEMPLATE=RNAseqCluster ENABLE_EXPERIMENTAL=True [aws info] AWS_ACCESS_KEY_ID = AWS_SECRET_ACCESS_KEY = AWS_USER_ID = ## e.g. AWS_USER_ID=999999999999 ## or e.g. AWS_USER_ID=arn:aws:iam::999999999999:user/username AWS_REGION_NAME = us-west-2 [key rnaseq] KEY_LOCATION= ## e.g. KEY_LOCATION=~/.ssh/myotherkey.rsa [cluster RNAseqCluster] KEYNAME = rnaseq CLUSTER_SIZE = 1 CLUSTER_USER = sgeadmin CLUSTER_SHELL = bash NODE_IMAGE_ID = NODE_INSTANCE_TYPE = r3.2xlarge DISABLE_QUEUE=True MASTER_INSTANCE_TYPE = r3.large VOLUMES = oasis, scratch1 PLUGINS = createusers, sge PERMISSIONS = http, https, ftp, vsftpd, galaxy, galaxy1 ## here we have configure two EBS volumes to be used by the cluster [volume oasis] VOLUME_ID = MOUNT_PATH = /home/oasis [volume scratch1] VOLUME_ID = MOUNT_PATH = /home/scratch1 # open port 80 on the cluster to the world [permission http] FROM_PORT = 80 TO_PORT = 80 [permission https] FROM_PORT = 443 TO_PORT = 443 [permission ftp] FROM_PORT = 20 TO_PORT = 21 [permission vsftpd] FROM_PORT = 13000 TO_PORT = 13100 [permission galaxy] FROM_PORT = 8080 TO_PORT = 8080 [permission galaxy1] FROM_PORT = 8088 TO_PORT = 8088 #### here, we create two cluster users [plugin createusers] SETUP_CLASS = starcluster.plugins.users.CreateUsers usernames = rnaseq, guest [plugin sge] SETUP_CLASS = starcluster.plugins.sge.SGEPlugin MASTER_IS_EXEC_HOST = False
Next is to install Galaxy server on your computer, preferred the master node of the computer. Please try these steps:
config/galaxy.ini config/tool_conf.xml config/tool_data_table_conf.xml static/welcome.html tool-data/bowtie2_indices.loc tool-data/README tool-data/tophat2_indices.loc tools/myrnaseqwf/README tools/myrnaseqwf/star-pe.xml tools/myrnaseqwf/star-se.xml tools/myrnaseqwf/tophat-mapping-pe.xml tools/myrnaseqwf/tophat-mapping-se.xml tools/myrnaseqwf/trinity-pe.xml tools/myrnaseqwf/trinity-se.xml
In $RNA_SEQ_ROOT_DIR/web directory, there is a job.php. You need to configure web server and enable PHP (ask you IT) in a way that you can access http://your_URL/job.php. Try http://your_URL/job.php?20160711094138113678029672. Here 20160711094138113678029672 is a sample results directory copied into $RNA_SEQ_ROOT_DIR/web/user-data. Some large files are deleted to reduce the size of the software package, so some broken links are expected in page http://your_URL/job.php?20160711094138113678029672.
Reference genomes can be downloaded from ftp://weizhongli-lab.org/pub/reference-genomes/. You should put the genome files under $RNA_SEQ_ROOT_DIR/refs.
Go to $RNA_SEQ_ROOT_DIR/NGS-tools, edit following files
#### go to line #7 and replace /home/oasis/data/NGS-ann-project with your $RNA_SEQ_ROOT_DIR $NGS_root = "/home/oasis/data/NGS-ann-project";
#### go to line #31 $ENV{"PATH"} = "/home/oasis/data/NGS-ann-project/apps/bin:". $ENV{"PATH"}; #### replace /home/oasis/data/NGS-ann-project/apps/bin: with $RNA_SEQ_ROOT_DIR/apps/bin: my $www_dir = "/home/oasis/data/RNA-seq-dir/web/user-data"; #### replace /home/oasis/data/RNA-seq-dir/web/user-data with $RNA_SEQ_ROOT_DIR/web/user-data my $www_web_url = "http://weizhongli-lab.org/RNA-seq/Data/job.php"; #### replace http://weizhongli-lab.org/RNA-seq/Data/job.php with http://your_URL/job.php #### your_URL is the URL you just setup
Restart Galaxy server and hope some magic will happen.
If you find cd-hit helpful to your research and study, please kindly cite this reference:
Please also cite the original methods integrated into our workflows: