User Tools

Site Tools


rna-seq-user-guide

RNA-seq web portal User's Guide

Last updated: 2017/01/08 21:42

http://weizhongli-lab.org/RNA-seq

Developed by Weizhong Li's lab at JCVI http://weizhongli-lab.org wli@jcvi.org liwz@sdsc.edu

Introduction

The RNA-seq portal provides integrated computational tools and workflows for RNA-seq based gene expression analysis for agriculturally important animal species. To help researchers in data analysis, this portal is designed to allow to run end to end computational workflows for multiple samples with minimal efforts. The project is developed with support from NIFA (award #2013-67015-21428). To contact us, please visit our RNA-seq project page, where you can send us comments using the web form.

Get started with the portal

You can also start with this video clip on YouTube (~ 15 minutes).

Learn Galaxy framework

The web portal to run RNA-seq analysis (see figure below) is implemented using Galaxy framework. Users new to Galaxy are recommended to learn Galaxy’s concept and know the basic usage, please visit Galaxy’s Wiki page and Galaxy wiki document for more information.

You don't need to be a Galaxy expert to use this portal. If you are completely new to Galaxy, you may find there are too much document from Galaxy project to read. You can just watch some screencasts. You don't need to spend more than 30 minutes.

Registration

In order to use this portal, please first to register. The registration is very simple: hit registration under User tab or click http://weizhongli-lab.org:8088/user/create and follow the instructions.Once you register, you will be able to access your data, analysis results from the Galaxy history panel, which appears on the right side of the browser.

Use the provided examples

After registration, please login to the portal. Some RNA-seq sequences are shared with all users as examples. From the top menu, click “shared data”, then click “Data libraries”, a new page will show. Click the “fruitfly-RNA-seq-example”, another page will load. check all the datasets by clicking checkbox next to “name” button, below these datasets, select “Import to current history” (which is default), and click “Go” button on the right. These datasets will be loaded to your history (see figure below).

These datasets are from paper by Trapnell et al. Nat Protoc. (2012) 7(3): 562–578. They are RNA-seq reads from fruitfly under two conditions (C1 and C2). There are 3 samples (R1, R2, R3) in each group, 6 samples total. Since these sequences are pair ended reads, there are 12 fastq files total. These samples were down-sized to 2.5 million reads per sample.

Run a tophat workflow

Next, let's run these samples using the tophat PE reads workflow (see figure below).

  • click “RNA-seq workflows” from left panel
  • click “tophat PE reads”, a new page will load
  • select Drosophila_melanogaster under “Please select a reference genome”
  • select group (you can analyze up to 4 groups of samples with this workflow)
  • please input a unique sample name
  • from the pull down select “smGSM794483_C1_R1_1.fq” for Fastq file R1 and “smGSM794483_C1_R1_2.fq” for Fastq file R2
  • click “add new FASTQ file for sample” button to add next sample
  • repeat to add additional samples
  • click the execute button
  • when job finished, it will appear in your history panel on right
  • click the “eye” icon to view and download the results.

View and download the results

Depending on size of your job and number of jobs in the job queue, it may take hours or longer to finish. When the job is completed. The history will become green, click the “eye” icon, there will be a link on the page direct to the results (see figure below).

The workflow generates a lot of output files, such as alignments .bam files, gene model .bed files, gene expression data, assemblies, log files etc. All these files are available for download. You can download a compressed package containing all the files to your computer and analyze them locally, or you can browse and access the individual files from our RNA-seq server. The figure shows a sample job download page.

If you want to analyze the job results locally. You will also need to download the reference genomes, gene models from the RNA-seq portal.

Run your RNA-seq samples

The first step for your analysis is to get the data uploaded to the portal. All the workflows in this portal accept sequences in FASTQ format, either paired end (PE) reads or single end (SE) reads. For PE reads, you need to have two reads R1 and R2 in separate FASTQ files.

To upload your data, please first login to the web portal, click the “Get Data”, then “Upload File from your computer” link on left column, a new window will appear, where you can drag and drop fastq files (see figure below).

  • Please use gzipped fastq file for faster upload (uncompressed file is ok), gzip each individual fastq file.
  • Please choose fastq as type (you can modify type later if you forget).
  • Hit start button, this may take a while.

You can also upload your data to GenomeSpace and send it back to galaxy. Click the “Get Data” and then “GenomeSpace import from file browser”. You can see a new window of GenomeSpace. To do this, you need to register a GenomeSpace account first and upload your data into it, then select the data you want to upload, and click “Send to Galaxy”. You will go back to galaxy server and it may take a while to upload the data.

Most importantly, you can also upload large files (>2.0Gb) via our FTP server. If you are comfortable with Linux, you can access it through terminal like this.

$ftp 52.11.154.125 #Provide your galaxy email and password as Name and Password.

ftp> put PATH/TO/FastqFile uploaded-1.fastq

If you are not comfortable with Linux, then forget about it, since you can also log in by FileZilla and drag whatever you want to the ftp server. To do it, you need to first download FileZilla and it is free. Once you installed it in your computer, you can edit Site Manager to save our ftp server as one of your sites. Please make sure to set Timeout in seconds to “120” in Connection Settings and select “Only use plain FTP (insecure)” for Encryption, “Active” for Transfer mode.

Then Click “Connect” button and provide your password. Then you can log into FTP and drag the file from local site to remote site.

It is quite time-consuming if your files are extremely large. After the ftp transfer is over, you can log into our galaxy server to see the uploaded file in “Choose FTP file” and select the files submitted to galaxy.

After your file is submitted into galaxy, it will remove from FTP server.

Once the upload is completed, you will see them in your history at right column. Please hit the “Edit attributes” icon to add or modify the data. It is highly recommended to give your data a simple and meaningful name (e.g. “tissue1-replicate1-R1.fq”, “control-replicate-R1.fq” etc), avoiding using special characters.

Sometimes you may need to combine the reads from different lanes for each sample. You can use cat function in Linux, but you can also click “Text Manipulation”, and then “Concatenate datasets tail-to-head” to combines them.

Workflows

The portal currently offers three end-to-end workflows. All these workflows are implemented to handle multiple samples and multiple groups. A workflow will perform identical process (e.g. mapping) for each individual samples, will compare the results between groups or samples, and may analyze data based on pooled samples or groups.

The Tophat, cufflink and cuffdiff workflow, also know as the TUXEDO Package, is one of the most widely used tools in RNA-seq data analysis. The workflow we implemented here is based on the pipeline described in paper by Trapnell et al. Nat Protoc. (2012) 7(3): 562–578.. The pipeline is shown in the figure below. Please refer to this paper for more information.

For each individual sample, the workflow runs the following process:

  • QC: the low quality reads will be removed and low quality bases will be trimmed. This is done with program Trimmomatic with default parameters.
  • Tophat: high quality reads are mapped to the selected reference genome
  • Cufflink: assemble the transcripts

For all the samples together, the workflow runs

  • Cuffmerge: merge the transcripts from all the samples

For any pair of two groups, the workflow runs

  • Cuffdiff: compare the transcripts between two groups

The workflow generates a lot of files, including

  • accepted_hits.bam, align_summary.txt, deletions.bed, insertions.bed, junctions.bed, unmapped.bam for each sample from tophat
  • transcripts.gtf, genes.fpkm_tracking, isoforms.fpkm_tracking for each sample from cufflink
  • merged.gtf from cuffmerge
  • gene_exp.diff, genes.count_tracking, genes.fpkm_tracking, genes.read_group_tracking and many others for two groups from cuffdiff

A example of the Tophat, cufflink and cuffdiff workflow is available from this link.

STAR mapping and post analysis workflow

This workflow uses STAR, a ultrafast RNA-seq aligner, for reads mapping to reference genome.

For each sample, the workflow runs the following process:

  • QC: the low quality reads will be removed and low quality bases will be trimmed. This is done with program Trimmomatic with default parameters.
  • map the reads to reference genome (STAR's 1st pass mapping)
  • run 2nd pass mapping using the detected junctions from 1st pass pooled from all samples
  • run RSEM

For any pair of sample groups, the workflow runs differential analysis with EdgeR

The output from the workflow includes:

  • coordinator-sorted bam file for alignments to reference genome
  • bam file for alignments to reference transcriptome
  • bedgraph file for alignmentss to reference genome
  • RSEM gene results and isoform results
  • differential analaysis results

A example of the STAR workflow is available from this link.

Trinity assembly and post-analysis workflow

This workflow is implemented according to the the Trinity paper in Nature Protocols, 8 (8), pp. 1494–1512. Please refer to this paper. Additional information about the protocol are described at http://trinityrnaseq.github.io.

This workflow runs Trinity for multiple samples from multiple groups. It first assembles the reads pooled from all the input samples into transcriptome, then it aligns the reads from each individual sample to the assembled transcriptome with bowtie and perform Post-assembly Transcriptome Analysis using RSEM.

The workflow generates a lot of files, including

  • Trinity.fasta: assembled transcriptome
  • RSEM.genes.results, RSEM.isoforms.results: gene and isoform abundance calculated by RSEM
  • bowtie.bam alignment between sample and assembled transcriptome
  • gene and transcript differential analysis data by the post-Trinity scripts and 3rd party tools (e.g. EdgeR)

A example of the Trinity workflow is available from this link.

Download reference genomes and annotation data

The reference genomes, gene annotations and formatted genome are available from RNA-seq portal. You will need these data for further analysis of your workflow results. You can also download the reference genomes through our FTP server as anonymous users in terminal or FileZilla just as described above.

Using genome browser IGV to visualize your data

The Integrative Genomics Viewer (IGV) is a very popular tool for interactive exploration of large, integrated genomic datasets. IGV can be used to visualize the results generated by our workflows. In order to facilitate users to use IGV, we have per-formatted genome and annotations in IGV format for all the species in our portal. In order to allow IGV to access the genomes in our server, please configure IGV to set the genome server URL to http://weizhongli-lab.org/RNA-seq/Data/reference-genomes.txt (see figure below). Please also enable port “60151”.

After this configuration, you will be able to select the reference genome within IGV.

In our workflow, all the BAM files sorted, so you can directly load them into IGV. You can either download the files locally or directly load from our server (use File → load from URL options within IGV). You can visualize not only BAM, but also other type of files (e.g. BED files) generated from our workflows.

FAQ

To be added.

Software

To support users who prefer to run these workflows locally or want to setup web portal on their own servers, with the flexibility of using different parameters, our back-end software package is freely available.

System and software requirements

The software package needs to be installed on generic Linux computer clusters that support Open Grid Engine. Here is a list of requirements:

  • Linux computer cluster, with shared users' HOME directories
  • At least 1TB (>5TB is desired) shared file system as working space
  • 64GB RAM computer are needed, 128 or 256 GB RAM computer are preferable for memory-consuming jobs, e.g. Trinity
  • Open Grid Engine (OGE) installed on the cluster
  • Galaxy system installed on the master node of the cluster

The system was implemented based on many software tools. so it is important for the developers that maintain this software package to have experiences in following computer languages and systems:

  • Galaxy
  • Shell and Perl programming (needed to work with workflow engine)
  • Python programming (needed to better work with Galaxy)
  • AWS cloud computing (needed if utilize AWS)
  • PHP/JavaScript/CSS (needed for web)
  • RNA-seq tools
  • R

The software package is available from http://weizhongli-lab.org/software/RNA-seq-portal-release.V0.0.1.tar.gz and ftp://weizhongli-lab.org/pub/software/. Please download and unpack the tar ball, e.g. RNA-seq-portal-release.V0.0.1.tar.gz to a folder (e.g. /home/oasis/RNA-seq-home, let's call it $RNA_SEQ_ROOT_DIR) in large shared file system if you already have a Linux cluster already. Otherwise, just unpack to a temporary working directory and move to large file system later. Please then follow the next steps.

Prepare Linux clusters

Option 1 is to ask your IT department to setup a Linux cluster that support OGE and other requirements listed above.

Option 2 is to utilize Amazon cloud or other cloud providers e.g. Google and Microsoft. We have used Amazon to support our web portal and we share our experiences here:

  • Obtain a AWS account
  • Understand basic usage of AWS to setup key pairs, configure security groups, star/stop/terminate instances, setup EBS volumes, use s3, etc. There are tons of documents at AWS, please visit https://aws.amazon.com/ and https://aws.amazon.com/documentation/ec2/
  • We use Starcluster to manage virtual cluster on AWS cloud, please download Starcluster from http://star.mit.edu/cluster/.
  • Understand basic usage of Starcluster
  • Our RNA-seq tar ball has a template config file can be used, see $RNA_SEQ_ROOT_DIR/AWS/StarCluster-config-file-template.
  • Please modify this template file using your key pairs, security settings, EBS volumes etc, see below
[global]
DEFAULT_TEMPLATE=RNAseqCluster
ENABLE_EXPERIMENTAL=True
[aws info]
AWS_ACCESS_KEY_ID =
AWS_SECRET_ACCESS_KEY =
AWS_USER_ID =
## e.g. AWS_USER_ID=999999999999
## or e.g. AWS_USER_ID=arn:aws:iam::999999999999:user/username
AWS_REGION_NAME = us-west-2

[key rnaseq]
KEY_LOCATION=
## e.g. KEY_LOCATION=~/.ssh/myotherkey.rsa

[cluster RNAseqCluster]
KEYNAME = rnaseq
CLUSTER_SIZE = 1
CLUSTER_USER = sgeadmin
CLUSTER_SHELL = bash
NODE_IMAGE_ID = 
NODE_INSTANCE_TYPE = r3.2xlarge
DISABLE_QUEUE=True
MASTER_INSTANCE_TYPE = r3.large
VOLUMES = oasis, scratch1
PLUGINS = createusers, sge
PERMISSIONS = http, https, ftp, vsftpd, galaxy, galaxy1

## here we have configure two EBS volumes to be used by the cluster
[volume oasis]
VOLUME_ID = 
MOUNT_PATH = /home/oasis

[volume scratch1]
VOLUME_ID = 
MOUNT_PATH = /home/scratch1

# open port 80 on the cluster to the world
[permission http]
FROM_PORT = 80
TO_PORT = 80

[permission https]
FROM_PORT = 443
TO_PORT = 443

[permission ftp]
FROM_PORT = 20
TO_PORT = 21

[permission vsftpd]
FROM_PORT = 13000
TO_PORT = 13100

[permission galaxy]
FROM_PORT = 8080
TO_PORT = 8080

[permission galaxy1]
FROM_PORT = 8088
TO_PORT = 8088

#### here, we create two cluster users
[plugin createusers]
SETUP_CLASS = starcluster.plugins.users.CreateUsers
usernames = rnaseq, guest

[plugin sge]
SETUP_CLASS = starcluster.plugins.sge.SGEPlugin
MASTER_IS_EXEC_HOST = False

Install and configure Galaxy

Next is to install Galaxy server on your computer, preferred the master node of the computer. Please try these steps:

  • Galaxy is easy to start, but at a certain point, patient is needed to overcome the learning curve (my experience). So be prepared to spend more than expected time on Galaxy software. You need to be conformable with setting up working Galaxy server and with adding new tools to Galaxy portal to continue with RNA-seq portal.
  • After you become a Galaxy expert, check the $RNA_SEQ_ROOT_DIR/galaxy and you will find these files:
config/galaxy.ini
config/tool_conf.xml
config/tool_data_table_conf.xml
static/welcome.html
tool-data/bowtie2_indices.loc
tool-data/README
tool-data/tophat2_indices.loc
tools/myrnaseqwf/README
tools/myrnaseqwf/star-pe.xml
tools/myrnaseqwf/star-se.xml
tools/myrnaseqwf/tophat-mapping-pe.xml
tools/myrnaseqwf/tophat-mapping-se.xml
tools/myrnaseqwf/trinity-pe.xml
tools/myrnaseqwf/trinity-se.xml
  • Please merge config/galaxy.ini, config/tool_conf.xml, config/tool_data_table_conf.xml into the corrsponding files in your own Galaxy folder. And since you are Galaxy expert, you should know what you are doing.
  • edit static/welcome.html to reflect your local setting.
  • copy tool-data/*.loc to your galaxy tool-data folder and edit according to tool-data/README
  • copy tools/myrnaseqwf folder to your galaxy tools folder and edit according to tools/myrnaseqwf/README
  • check all these updated files to make sure everything added are cross-linked (path, filename etc)

Configure web server

In $RNA_SEQ_ROOT_DIR/web directory, there is a job.php. You need to configure web server and enable PHP (ask you IT) in a way that you can access http://your_URL/job.php. Try http://your_URL/job.php?20160711094138113678029672. Here 20160711094138113678029672 is a sample results directory copied into $RNA_SEQ_ROOT_DIR/web/user-data. Some large files are deleted to reduce the size of the software package, so some broken links are expected in page http://your_URL/job.php?20160711094138113678029672.

Download reference genomes

Reference genomes can be downloaded from ftp://weizhongli-lab.org/pub/reference-genomes/. You should put the genome files under $RNA_SEQ_ROOT_DIR/refs.

Install workflow scripts

Go to $RNA_SEQ_ROOT_DIR/NGS-tools, edit following files

  • NGS-wf-galaxy-RNAseq-config.pl
#### go to line #7 and replace /home/oasis/data/NGS-ann-project with your $RNA_SEQ_ROOT_DIR
$NGS_root     = "/home/oasis/data/NGS-ann-project";
  • NGS-wf-galaxy-run.pl
#### go to line #31
$ENV{"PATH"} = "/home/oasis/data/NGS-ann-project/apps/bin:". $ENV{"PATH"};
#### replace /home/oasis/data/NGS-ann-project/apps/bin: with $RNA_SEQ_ROOT_DIR/apps/bin:

my $www_dir     = "/home/oasis/data/RNA-seq-dir/web/user-data";
#### replace /home/oasis/data/RNA-seq-dir/web/user-data with $RNA_SEQ_ROOT_DIR/web/user-data

my $www_web_url = "http://weizhongli-lab.org/RNA-seq/Data/job.php";
#### replace http://weizhongli-lab.org/RNA-seq/Data/job.php with http://your_URL/job.php
#### your_URL is the URL you just setup

Restart Galaxy server

Restart Galaxy server and hope some magic will happen.

References

If you find cd-hit helpful to your research and study, please kindly cite this reference:

  • Li, Weizhong, R. Alexander Richter, Yunsup Jung, Qiyun Zhu, and Robert W. Li. “Web-based bioinformatics workflows for end-to-end RNA-seq data computation and analysis in agricultural animal species.” BMC genomics 17, no. 1 (2016): 761. link

Please also cite the original methods integrated into our workflows:

  • Trapnell, Cole and Roberts, Adam and Goff, Loyal and Pertea, Geo and Kim, Daehwan and Kelley, David R and Pimentel, Harold and Salzberg, Steven L and Rinn, John L and Pachter, Lior (2012). Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. In Nat Protoc, 7 (3), pp. 562–578. link
  • Haas, Brian J and Papanicolaou, Alexie and Yassour, Moran and Grabherr, Manfred and Blood, Philip D and Bowden, Joshua and Couger, Matthew Brian and Eccles, David and Li, Bo and Lieber, Matthias and et al. (2013). De novo transcript sequence reconstruction from RNA-seq using the Trinity platform for reference generation and analysis. In Nature Protocols, 8 (8), pp. 1494–1512. link
  • Dobin, A. and Davis, C. A. and Schlesinger, F. and Drenkow, J. and Zaleski, C. and Jha, S. and Batut, P. and Chaisson, M. and Gingeras, T. R. (2012). STAR: ultrafast universal RNA-seq aligner. In Bioinformatics, 29 (1), pp. 15–21. link
  • Goecks, J, Nekrutenko, A, Taylor, J and The Galaxy Team. Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol. 2010 Aug 25;11(8):R86. link
rna-seq-user-guide.txt · Last modified: 2017/01/08 21:42 by liwz