GBS Pipeline Tutorial - Step 4 - SNP Calling

SNP Calling

Introduction:

Welcome to the final step in the GBS pipeline! This has potentially been a week long process up till now so I hope these tutorials have helped you see it through to the end. If you feel that at any point there was something I missed, feel free to send one of us an email and I will add the necessary material in! Any support or feedback is greatly appreciated and could make the process easier for someone in the future. As I have said in the previous sections, if you have any question up to this point then now is the time to ask. If you just want someone to consult on your data with, Carolyn or Larissa can help you. The goal of this step is to finally call the SNP's from the reads you have gathered. The reads have been aligned and now it is time to see where they differ when compared to the original genome!

Materials:

There are no more needed materials for this step. Hooray!

Editing the GBS.conf file:

The only thing that needs to be done prior to running the final command is to edit the "GBS.conf" file for the last time. There is one field that we need to fill in and this field can be a tricky one to understand. When running the SNP calling programs we can choose one of three modes: single, multi, or both. The "single" mode is designed to call SNPs on each sample independent of the other samples. This will also result in one file being created for every sample that you have. The "multi" mode will take into consideration ALL of your samples when calling SNPs. This mode will place ALL of the output into one single VCF file. The last one is both, which will run all of your files using both modes. Now it may seem like there is really no difference between theses modes but depending on the relationships of your samples you may prefer one over the other. If you are running something along the lines of a wild diversity panel then you would most likely prefer the "single" mode as the samples are analyzed independently of eachother. If your samples do have some interconnectedness or relationship then you may be more interested in the "multi" mode. This is because the "multi" mode will consider all of your samples together for the SNP calling process. Or if you are entirely unsure you could ask a bioinformatics member and they can discuss what might be best for your GBS run. The "both" mode provides you with the most flexibilty but this can fill up quite a bit of storage room on the server, as well as take the longest to run. Also, if you do run it with one mode, you still have the option to re-run the SNP calling step but with a different mode.

Calling SNP's:

With the appropriately filled out "GBS.conf" file all we need to do now is run the command! One last time I will remind you to start a new screen session and check htop for server usage. This process is not as data intensive as some other steps but in the "GBS.conf" file I have set it up to run on 20 different processors so please make sure there is some room for you to run the program! When you have the screen set up and the server is clear, type the following command into your terminal:

/storage/bin/Applications/GBSpipeline/GBS_pipeline.pl call_SNPs

This command will start the SNP calling process. Again you should have some text pop up with a progress bar and general information on what sample the program is currently working with and where the processed data will be placed! If this function has worked for you then you have finally made it through the GBS pipeline. Congratulations! It is definitely a process and can be frustrating at times. As I have already stated in the introduction, if you have any concerns about the pipeline or the tutorial please contact one of us with your issue. We will figure out what to do and adjust the pipeline or tutorial to make it a smoother process for you and any other users. Thank you for using this tutorial and I sincerely hope that it has helped you through this process.