Background

The Million Mutation Project (MMP), a joint project by the Moerman and Waterston labs, has exploited inexpensive whole genome resequencing to identify multiple mutations in virtually every C. elegans gene in a collection of 2,000 mutagenized strains. Our early study exploring massively parallel short read sequencing to detect mutations in the C. elegans genome (Flibotte et al, 2010; Genetics 185: 431-441), showing that individual strains carried as many as a few hundred mutations, including missense, nonsense and splice site mutations. Thus in a collection of 2,000 strains we could recover multiple mutations in virtually every gene in the genome. Sequence characterization of the collection provides the community with a resource where obtaining mutations in a gene of interest is as simple as ordering a few strains from the stock center. The large number of mutations in each strain allows comprehensive coverage of the genome in a relatively small number of strains, simplifying phenotypic screening and other manipulations. In turn, mutations of interest can be placed in a normal background through simple outcrossing that takes only a week or two in C. elegans. The collection as a whole might also be screened using secondary treatments, such as pharmaceutical reagents or RNAi, to look for interacting genes.

For the MMP we have built a library of 2,007 mutagenized strains and sequenced each to a target depth of 15x genome coverage. After testing various mutagens we settled on using EMS, ENU or a cocktail of EMS plus ENU. The cocktail was used for over half the library. By using both mutagens we obtained a wider range of nucleotide transitions and transversions than with EMS alone but recovered more mutations per strain than with ENU alone. As a guide to the mutagen used for specific strains, VC10xxx strains were isolated after UV/TMP mutagenesis, VC20xxx strains were isolated after EMS mutagenesis, VC30xxx strains were isolated after ENU mutagenesis and VC40xxx strains were isolated after EMS+ENU mutagenesis. Our starting wild type strain was VC2010, which was sequenced and described in Flibotte et al (2010). To insure that animals picked were mutagenized, we selected for unc-22 mutations in the F1 generation. In the F2 we selected for non-Unc-22 animals and then these animals were self-crossed for 8 additional generations to drive the strain to homozygosity across all regions of the genome. To supplement this collection of mutagenized strains, we also sequenced 40 wild isolates, including the Hawaiian strain CB4856 as well as representative strains from across the world. For sequencing we loaded size-selected, barcoded samples on either an Illumina GAII, or Hi-Seq sequencing machine and did paired-end reads. For data analysis we used phaster (P. Green, unpublished) for aligning sequence. We used SAMtools to identify possible single nucleotide variants (SNVs) and then applied post-filtering to look for homozygous, high quality SNVs. In particular, we eliminated SNVs at about 100,000 sites that yielded variable numbers of disagreeing bases across all the strains, thus removing many false positives. A list of these sites is available for download here, and contains parental SNVs available separately here. Insertion and deletion differences were detected using a variety of custom built tools. Again we eliminated sites that behaved variably across the strains. Copy number variants (CNVs) were detected by variations in read coverage. The SNVs and indels are currently interpreted using WS230 annotations and have been lifted over to WS235 for this web site. The sequence data are deposited in the SRA under the accession SRP018046, the variant calls submitted to WormBase and the strains deposited at the Caenorhabditis Genetics Center in Minneapolis, Minnesota. Eventually we hope to distribute the 2,007 strains as a single kit, allowing parallel experimentation on a wide spectrum of mutant genes. We hope this complex library of mutations will act as a community resource.

The site reports results from the analysis of 2,007 mutagenized strains and 40 wild isolates:

Results from mutagenized strains

The data set containing all mutations in the mutagenized strains is available for download here. (file size: 76 MB)

Results from wild isolates

The data set containing all mutations in the wild isolates is available for download here. (file size: 438 MB)

Mutagens used

Additional files for download (see Background section for more explanation)

A paper describing the results and our methods is in press:

The Million Mutation Project:
A New Approach to Genetics in Caenorhabditis elegans.

Owen Thompson1, Mark Edgley2, Pnina Strasbourger1, Stephane Flibotte2, Brent Ewing1, Ryan Adair2, Vinci Au2, Iasha Chaudhry2, Lisa Fernando2, Harald Hutter3, Armelle Kieffer2, Joanne Lau2, Norris Lee2, Angela Miller2, Greta Raymant2, Bin Shen2, Jay Shendure1, Jon Taylor2, Emily H. Turner1, LaDeana W. Hillier1, Donald G. Moerman2*, Robert H. Waterston1*

1 Department of Genome Sciences, University of Washington, Seattle, WA, U.S.A.
2 Department of Zoology and Michael Smith Laboratories, University of British Columbia, Vancouver, BC, V6T 1Z3 Canada
3 Department of Biological Sciences, Simon Fraser University, Burnaby, B.C., Canada
* Co-corresponding authors

A few caveats about using the strains

While we have endeavored to be as accurate as possible in calling SNVs and indels there will be some false positives (estimated at about 1%) and false negatives (estimated to be 7% for SNVs and slightly higher for indels) in a project of this scale. Despite the multiple generations of self-crossing, regions of some strains remain heterozygous by sequence criteria. These may represent simply chance events, whereas others appear to represent balanced mutations. A display showing sites with likely heterozygous SNVs along with regions of likely CNVs is available using the "plot" link next to each strain in the search results. Also, because the sequence coordinates for this web site have been simply lifted over from WS230 to WS235 to be compatible with WormBase, some rare discrepancies may appear in the annotation. When investigating particular mutations, be sure to verify the change before proceeding with further studies. Should you find discrepancies please contact Mark Edgley edgley@mail.ubc.ca

Database History

July 10, 2014

added data for restriction enzymes recognizing the polymorphisms

May 27, 2014

removed data for strain VC60667

May 29, 2013

corrected an error leading to incomplete datasets being displayed when searching for SNVs only (affecting mainly intronic and intergenic mutations)

March 26, 2013

the following changes were made to the database:

  • final release of all data from 2,007 mutagenized strains and 40 wild isolates

July 19th, 2012

the following changes were made to the database:

  • more than 14,000 short indels of a few nucleotides (plus and minus) were added and annotated. Data for larger indels will be added at a later date.
  • PIWI interacting RNAs are now listed separately as piRNA (were previously annotated as ncRNA).
  • mutations that alter the start methionine to another amino acid are now annotated with type: "start ATG" (were previously annotated as "missense")

June 11th, 2012
initial database release


This study was supported by NHGRI funding to Robert Waterston, LaDeanna Hillier, Jay Shendure and Donald Moerman. This study was also funded by CIHR funding to Donald Moerman and by NSERC and CIHR funding to Harald Hutter.