Data sets in GExplore

The following data sets are used in the GExplore database.

Protein data (domains and sequence)

Protein sequence data for the various nematode species accessible through the protein search interface were downloaded from Wormbase release 250. SMART was used to identify protein domains.

Mutation data

Mutation data accessible through the mutation search interface were downloaded from Wormbase release 250. These include only alleles, where the molecular nature is known and where the coding part of a gene is affected. Alleles listed on the result page of the 'Genes' search page contain all available alleles listed in Wormbase.

Gene description, expression and phenotype data

All data available for search and display through the genes search interface were downloaded from Wormbase using either the latest release available on WormMine (WB250) or files available from the Wormbase FTP server.

Gene expression data (life stages)

The expression data for the different stages of the life cycle of C. elegans was obtained as part of the NHGRI modENCODE project in a collaboration of the Waterston (Max Boeck, Chau Huynh, LaDeana Hillier, University of Washington), Reinke (Guilin Wang, Dionna Kasper, Yale University) and Miller (Clay Spencer, Vanderbilt University) labs1,2,3,4,5,6. Altogether, 18x samples were analyzed with RNA-seq. The results provided here are a subset of that data, derived from synchronized whole animals from embryonic and post-embryonic stages. For post-embryonic stages, two or more biological replicates were obtained; the weighted averages are provided. For the embryonic stages, 4 independent time series were collected, taking samples every 30 minutes (some samples were lost from technical failures). Simply combining the embryo series was precluded because each series had different distributions of developmental stages in the starting populations and because the each series had variations in growth conditions, particularly temperature. Instead we developed a Bayesian approach that exploited the distinct expression patterns of different genes to infer the expression at different stages (see below).

Expression units
Expression for each gene is given in dcpm units (depth of coverage per million 35 base reads)1. This measure has the advantage of providing a base-pair resolution estimate of expression, simplifying the analysis of shorter features such as exons and splice junctions. A dcpm value of 1.5 reflects an average level of expression. Values between 0.003 and 0.1 generally represent significant expression, but even though more than 20 million reads were collected per sample in almost all cases, there is increasing statistical noise with the lower range of values. As a rough approximation, dcpm can be converted to rpkm values using the following equation: rpkm=dcpm*1000/35.

Synchronization
Synchronized post-embryonic populations were obtained using the standard bleaching protocol followed by hatching in the absence of food. Mid-stage animals were then collected for each of the larval stages and young adult, using landmarks as described1. To obtain dauer entry, dauer and dauer exit populations we used daf-2(ts) mutants, using temperature shifts to control staging. To obtain males we used him-8 mutants, hand-picking L4 males. To obtain, L4 soma we used glp-1(ts) mutants grown at 25 C. Some of the differences in expression in the dauer and soma samples could be due to the higher growth temperatures.
Synchronized embryonic populations were obtained by harvesting synchronized adults (using the methods described above) just as the first animals were beginning to contain fertilized eggs. After bleaching worms were washed and incompletely dissolved adult carcasses were removed by filtration or by sucrose flotation (sucrose flotation had the added advantage of removing dead or damaged embryos). After filtration the egg population had largely 4-cell and 8-cell embryos with lesser numbers of 2-cell,15-cell and occasional 27-cell or later embryos. Inevitably there was some variation in the distribution of embryo stages in the different series. Samples were collected immediately upon completion of the final clean-up step and every 30 minutes thereafter.

RNA-seq library preparation and sequencing
Total RNA was prepared using standard methods. For the post-embryonic data presented here and for one embryonic series, poly-A+ RNA was selected using oligo-dT columns or beads. For the other three embryonic series(0223, 0411, 0419), rRNA was depleted using Ribo-Zero (Epicentre). In addition, Ribo-Zero, rRNA depleted post-embryonic samples were processed for several post-embryonic stages (L2, L4, YA) and data are available on the modENCODE website, but not presented here. Generally, for the protein coding genes, the Spearman correlation (r) of the poly-A+ and rRNA depleted samples is greater than 0.85. The notable exceptions are the poly-A- mRNAs of the constitutive histone genes. In the embryo series, these genes can reach quite levels of expression, constituting more than 30% of reads from mid-embryo stages.
cDNA was prepared using random hexamers as primers. After double-stranding, the products were sheared, linkers attached, PCR-amplified and sequenced using various Illumina instruments. Read length was increased as technology improved. Generally, reads were obtained from only one end.

Analysis
The resulting Illumina reads were aligned with cross-match to the genome, the splice junctions from the known transcriptome and to a library of possible splice sites. Reads containing splice leader sequences and polyA tails were detected and the untemplated bases trimmed before aligning. In each case the best match was used and the read then assigned to the corresponding genomic location. After removing rDNA matches, dcpm was calculated as previously described1.

Combining the embryonic series
Combining the data from the different embryonic time series presented several challenges. Each sample in each series was assumed to consist of a distribution of embryos of different stages. The initial starting population was a mixture of embryos at different stages of development; there was variation in the series start time; there was variation in the growth rate; and there were missing samples due to experimental failures. To deal with this we developed a model in which we assumed that matrix of measured gene expression for all the samples was the product of the matrix of expression for each gene at each stage (stage-expression array) times an array of the fraction of embryos in each individual stage for each sample (the proportion array). We then used a Bayesian framework to estimate the unknown parameters, assuming any errors were normally distributed with a gene specific variance. minimizing the overall error. For the expression array, we used 17 uniformly spaced time points to reflect the underlying data collection scheme. To constrain the proportion array we introduced four parameters -- the mean developmental time of the first sample, the relative growth rate of the series, the initial variance of the series, and the final variance. We assumed that the variance per sample could only increase with time. To train the model we selected the 8000 most highly expressed genes (max dcpm greater than 2) to provide a high signal to noise ratio and we used an MCMC algorithm to estimate the joint posterior probabilities of the unknown parameters.

References
1. Hillier LW, Reinke V, Green P, Hirst M, Marra MA, Waterston RH. Massively parallel sequencing of the polyadenylated transcriptome of C. elegans. Genome Res. 2009 Apr;19(4):657-66. PMID: 19181841 PMCID: PMC2665784. DOI: 10.1101/gr.088112.108
2. Gerstein MB, Lu ZJ, Van Nostrand EL, et al. Integrative analysis of the Caenorhabditis elegans genome by the modENCODE project. Science. 2010 Dec 24;330(6012):1775-87. PMID: 21177976 PMCID: PMC3142569 DOI: 10.1126/science.1196914.
3. Gerstein MB, Rozowsky J, Yan K, Wang D, et al. Comparative analysis of the transcriptome across distant species. Nature. 2014 Aug 28;512(7515):445-8. PMID: 25164755 PMCID: PMC4155737 DOI: 10.1038/nature13424.
4. Boeck, M et al. The time-resolved transcriptome of C. elegans. Genome Res. 2016 Oct;26(10):1441-1450. PMID: 27531719 PMCID: PMC5052054 DOI: 10.1101/gr.202663.115
5. Hillier, L et al, in preparation
6. Gevirtzman, L. et al, in preparation

Gene expression data (tissues and cell types)

Single cell expression data was generated from L2 worms using Single cell Combinatorial Indexing RNAseq (sci-RNAseq). See Cao et al. (2017) for details.