Background
CRISPR/Cas9 is the current technology of choice for genome editing. This is due to its versatility, as it is an RNA guided system where a 20-base crRNA and a tracrRNA (together called a guide RNA - sgRNA) direct a Cas9 nuclease (from Streptococcus pyogenes) to the target sequence providing high specificity with minimal secondary site effects. At the target site the endonuclease makes a double-stand cut of the DNA, which can then be resolved through Non-Homologous End Joining (NHEJ) or Homology Dependent Repair (HDR). CRISPR-Cas9 technology was first adapted for C. elegans in 2013 and since then the community has produced increasingly more sophisticated methods to mutate, delete and tag genes. These approached have included Cas9 variants with different protospacer-adjacent motifs (PAM sites).
At this site we use current (as of March 2016) best practices to identify potential guide RNAs for the entire C. elegans genome.
Guide design
The following steps were taken to design the guides present in the database. The calculations were performed with an in-house Perl script/wrapper using the genome sequence and gene annotation obtained from WormBase version WS250.
- In a first step, for each PAM site in the C. elegans genome we kept the corresponding adjacent 20-base guide only if its GC content was between 20% and 80% and no poly-T tracts of length 5 or longer were present. We annotated the presence or absence of the sequence GG at the 3’ end of each guide since guides ending with GG are expected to have higher efficiency for NGG PAM (Farboud and Meyer 2015 Genetics).
- Guides for which the seed region (defined as 12 bases at the 3’ end plus PAM) was not unique in the genome were eliminated. The uniqueness of those 15-mers was assessed with an in-house C code modified from Flibotte and Moerman BMC Genomic 2008.
- The guide + PAM sequences were then aligned to the whole genome with bwa aln (Li and Durbin 2009 Bioinformatics) allowing an edit distance of 3. Guides mapping to multiple locations in the genome were eliminated.
- The minimum free energy (referred to folding energy on this web site) in kcal/mol was calculated for each guide with the program hybrid-ss-min (Markham, N.R. 2003 “Hybrid: A software system for nucleic acid folding, hybridizing and melting predictions.” Masters thesis, Rensselaer Polytechnic Institute, Troy, NY.), values above 0 were set to 0.
- The location of the cut site associated with each guide was then annotated according to the gene feature being hit if any, keeping only one annotation per guide. Therefore, for example, a small RNA overlapping a protein coding gene will not be annotated in the database. In such cases, the user will have to search by location or by the name of the overlapping protein coding gene in order to find the guides with a cut site within the small RNA.
- The user can search the guides present in the database by entering genomic interval or search by gene name if he/she wishes to find guides with cut sites within genes. Constraints can be applied to the GC content, folding energy, and the presence of GG at the 3’ end of the guides being returned.
For questions concerning CRISPR/Cas9 and the role of this methodology in the goals of the C. elegans knockout facility contact Don Moerman. For more specific questions concerning parameters and filters used to choose the guide RNAs contact Stephane Flibotte.
How to view the guides in the integrative genomics viewer (IGV)
All guides are listed in a text format with their coordinates on the chromosome along with a listing of what gene feature they affect. For those who would like a more visual view of this information we suggest they use the following. All the guides passing the default filters are available for bulk download in bed format. Follow these instructions to view the data in IGV, the Integrative Genomics Viewer from the Broad Institute:
- download the compressed bed file (NGG_guides_WS250.bed.gz) and its index file (NGG_guides_WS250.bed.idx). Gunzip the bed.gz file and put the resulting bed file and the index file in the same folder/directory.
- If you do not have IGV already installed on your computer download it from the Broad Institute. In order to use IGV you will need a working version of Java to be installed on your computer. For example, the newer versions of Mac OSX do not come pre-installed with Java but it can be downloaded from Oracle following the instructions found here.
- Start IGV. On the top left corner of the graphic window you will find a widget to select the genome to the used. The last time the C. elegans genome was changed was at version WS235 so select any version higher than WS235, you might have to click on the “more…” option at the bottom of the list. The next time you use IGV it will remember your previous selection.
- Under “File”, select “Load from File…” and load the bed file, IGV will automatically find the corresponding index file you have saved in the same directory in step 1
- Useful hints for a better viewing experience in IGV:
- Right click (control click) on the track name and “Change Track Height…” to something larger.
- Right click (control click) on the track name and change the view from the default “Collapsed” to “Expanded”, both for the guide track and for the gene track if you want to see individual isoforms.
- You can navigate in the genome and zoom in using the mouse and/or the box where you enter coordinates. You can also use that box to search by gene name.
- You can mouse over a guide to view its sequence and parameters. Note that no scoring scheme has been implemented yet so the score always shows zero at the moment. You can right click (control click) on a guide and copy its sequence or all its information on the clipboard to paste into another application.
Additional C. elegans databases hosted here
- GExplore - a tool for large-scale mining of data related to gene or protein function in C. elegans
- The Million Mutation Project - a collection of multiple mutations in virtually every C. elegans gene identified in 2,000 mutagenized strains