Background
The Million Mutation Project (MMP), a joint project by the Moerman and Waterston labs, has exploited inexpensive whole genome resequencing to identify multiple mutations in virtually every C. elegans gene in a collection of 2,000 mutagenized strains. Our early study exploring massively parallel short read sequencing to detect mutations in the C. elegans genome (Flibotte et al, 2010; Genetics 185: 431-441), showing that individual strains carried as many as a few hundred mutations, including missense, nonsense and splice site mutations. Thus in a collection of 2,000 strains we could recover multiple mutations in virtually every gene in the genome. Sequence characterization of the collection provides the community with a resource where obtaining mutations in a gene of interest is as simple as ordering a few strains from the stock center. The large number of mutations in each strain allows comprehensive coverage of the genome in a relatively small number of strains, simplifying phenotypic screening and other manipulations. In turn, mutations of interest can be placed in a normal background through simple outcrossing that takes only a week or two in C. elegans. The collection as a whole might also be screened using secondary treatments, such as pharmaceutical reagents or RNAi, to look for interacting genes.
For the MMP we have built a library of 2,007 mutagenized strains and sequenced each to a target depth of 15x genome coverage. After testing various mutagens we settled on using EMS, ENU or a cocktail of EMS plus ENU. The cocktail was used for over half the library. By using both mutagens we obtained a wider range of nucleotide transitions and transversions than with EMS alone but recovered more mutations per strain than with ENU alone. As a guide to the mutagen used for specific strains, VC10xxx strains were isolated after UV/TMP mutagenesis, VC20xxx strains were isolated after EMS mutagenesis, VC30xxx strains were isolated after ENU mutagenesis and VC40xxx strains were isolated after EMS+ENU mutagenesis. Our starting wild type strain was VC2010, which was sequenced and described in Flibotte et al (2010). To insure that animals picked were mutagenized, we selected for unc-22 mutations in the F1 generation. In the F2 we selected for non-Unc-22 animals and then these animals were self-crossed for 8 additional generations to drive the strain to homozygosity across all regions of the genome. To supplement this collection of mutagenized strains, we also sequenced 40 wild isolates, including the Hawaiian strain CB4856 as well as representative strains from across the world. For sequencing we loaded size-selected, barcoded samples on either an Illumina GAII, or Hi-Seq sequencing machine and did paired-end reads. For data analysis we used phaster (P. Green, unpublished) for aligning sequence. We used SAMtools to identify possible single nucleotide variants (SNVs) and then applied post-filtering to look for homozygous, high quality SNVs. In particular, we eliminated SNVs at about 100,000 sites that yielded variable numbers of disagreeing bases across all the strains, thus removing many false positives. A list of these sites is available for download here, and contains parental SNVs available separately here. Insertion and deletion differences were detected using a variety of custom built tools. Again we eliminated sites that behaved variably across the strains. Copy number variants (CNVs) were detected by variations in read coverage. The SNVs and indels are currently interpreted using WS230 annotations and have been lifted over to WS235 for this web site. The sequence data are deposited in the SRA under the accession SRP018046, the variant calls submitted to WormBase and the strains deposited at the Caenorhabditis Genetics Center in Minneapolis, Minnesota. Eventually we hope to distribute the 2,007 strains as a single kit, allowing parallel experimentation on a wide spectrum of mutant genes. We hope this complex library of mutations will act as a community resource.
The site reports results from the analysis of 2,007 mutagenized strains and 40 wild isolates:
Results from mutagenized strains
- 840,429 SNVs, representing 826,810 different mutational events in 20,115 genes.
- 17,333 indels with defined end points (12,321 deletions, 5,012 insertions), reflecting 14,881 unique changes (10,420 deletions, 4,461 insertions).
- 1,483 large CNVs, of which 1,222 are distinct, in 887 strains
- 183,327 non-synonymous changes in 19,666 different genes.
- 12,594 ‘knockouts’ (nonsense mutations or splicing defects) in 8,150 genes
- On average, about 400 mutations per strain.
- On average, just less than 9 new nonsynonymous alleles per gene.
- On average, about 4 nonsense alleles per strain.
The data set containing all mutations in the mutagenized strains is available for download here. (file size: 76 MB)
Results from wild isolates
- 3,789,728 SNVs consisting of 630,541 unique events.
- 1,213,067 indels consisting of 220,823 unique events.
- 66,576 non-synonymous changes in 14,062 different genes.
- 1,560 ‘knockouts’ (nonsense mutations or splicing defects) in 1,323 genes.
- On average, around 95,000 SNVs and 30,000 indels per strain.
The data set containing all mutations in the wild isolates is available for download here. (file size: 438 MB)
Mutagens used
- VC10xxx strains were isolated after UV/TMP mutagenesis
- VC20xxx strains were isolated after EMS mutagenesis
- VC30xxx strains were isolated after ENU mutagenesis
- VC40xxx strains were isolated after EMS+ENU mutagenesis
Additional files for download (see Background section for more explanation)
- a list of sites that yielded variable numbers of disagreeing bases across all the strains
- a list of parental SNVs
A paper describing the results and our methods is in press:
The Million Mutation Project:
A New Approach to Genetics in Caenorhabditis elegans.
Owen Thompson1, Mark Edgley2, Pnina Strasbourger1, Stephane Flibotte2, Brent Ewing1, Ryan Adair2, Vinci Au2, Iasha Chaudhry2, Lisa Fernando2, Harald Hutter3, Armelle Kieffer2, Joanne Lau2, Norris Lee2, Angela Miller2, Greta Raymant2, Bin Shen2, Jay Shendure1, Jon Taylor2, Emily H. Turner1, LaDeana W. Hillier1, Donald G. Moerman2*, Robert H. Waterston1*
1 Department of Genome Sciences, University of Washington, Seattle, WA, U.S.A.
2 Department of Zoology and Michael Smith Laboratories, University of British Columbia, Vancouver, BC, V6T 1Z3 Canada
3 Department of Biological Sciences, Simon Fraser University, Burnaby, B.C., Canada
* Co-corresponding authors
A few caveats about using the strains
While we have endeavored to be as accurate as possible in calling SNVs and indels there will be some false positives (estimated at about 1%) and false negatives (estimated to be 7% for SNVs and slightly higher for indels) in a project of this scale. Despite the multiple generations of self-crossing, regions of some strains remain heterozygous by sequence criteria. These may represent simply chance events, whereas others appear to represent balanced mutations. A display showing sites with likely heterozygous SNVs along with regions of likely CNVs is available using the "plot" link next to each strain in the search results. Also, because the sequence coordinates for this web site have been simply lifted over from WS230 to WS235 to be compatible with WormBase, some rare discrepancies may appear in the annotation. When investigating particular mutations, be sure to verify the change before proceeding with further studies. Should you find discrepancies please contact Mark Edgley edgley@mail.ubc.ca
Database History
July 10, 2014 |
added data for restriction enzymes recognizing the polymorphisms |
May 27, 2014 |
removed data for strain VC60667 |
May 29, 2013 |
corrected an error leading to incomplete datasets being displayed when searching for SNVs only (affecting mainly intronic and intergenic mutations) |
March 26, 2013 |
the following changes were made to the database:
|
July 19th, 2012 |
the following changes were made to the database:
|
June 11th, 2012 |
initial database release |
This study was supported by NHGRI funding to Robert Waterston, LaDeanna Hillier, Jay Shendure and Donald Moerman. This study was also funded by CIHR funding to Donald Moerman and by NSERC and CIHR funding to Harald Hutter.