Helical Docs
Setup
Visit GitHub
Set theme to dark (⇧+D)

Genotype Files

This tutorial will walk you through the process of running helical geno commands. We will begin with analysing a geno file and understanding the different files that are created. To begin, download the genotypes file below. This file is not a representation of any population. It is a generated file.

Marker File

A file that holds the genotype data for animals. It is arranged in a matrix where each row represents a different animal and each column represents a different marker (SNP). The first column is the animal identifier.

Here is the sample geno file we will be using for this tutorial.

SNPs File

Pairs with a marker file and holds information about the markers. The first column is an identifier for the marker. The second column is the chromosome number. The third column is the position on the chromosome.

Here is the sample map file we will be using for this tutorial.

Notes

From this point on, I am going to assume you have downloaded a file called genotypes.

Analyse

With the geno file we may want a summary of the data. This can be done with the analyse command.

$ helical geno analyse genotypes

You should get this output:

Individuals:                     20
Individuals with no missing loci: 0
Markers:                         1,000
SNPs:                            50
Loci with no missing:            0
Average MAF:                     0.3680
+-----+-------+------+-----------+-------------+----------+------------+
| SNP | COUNT | PROP | GENO MEAN | GENO STDDEV | SNP MEAN | SNP STDDEV |
+-----+-------+------+-----------+-------------+----------+------------+
| AA  | 246   | 0.25 | 12.30     | 3.53        | 4.92     | 1.89       |
| AB  | 244   | 0.24 | 12.20     | 3.64        | 4.88     | 1.59       |
| BB  | 277   | 0.28 | 13.85     | 3.45        | 5.54     | 2.12       |
| NC  | 233   | 0.23 | 11.65     | 3.12        | 4.66     | 1.88       |
+-----+-------+------+-----------+-------------+----------+------------+
  • SNPs: signifies how many snps are found for each individual.
  • Average MAF: the average minor allele frequency.
  • Markers: the total amount of snps in the file.

Then there is a summary of each potential combination of alleles (AA, AB, BB, NC(No Call)). For each combination there is a geno mean which signifies the average of that given allele found in a row/individual. There is also a snp mean which is the average of all the snps found in a column/snp.

If you want more details you can specifify a directory to store additional files. Lets store them in a temporary folder in the current working directory.

$ helical geno analyse genotypes -d temp/geno

If you head over to your temp folder you should see a folder called geno.

$ cd temp/geno

And listing it - ls . - should show you the following files:

genotype_report.txt  outliers.txt  snp_report.txt  summary.txt

The summary.txt is the exact same thing you’ve seen above. If you want some more information on snps, take a look at snp_report.txt.

$ cat snp_report.txt

This will yield the output we want, but it’ll be in poor format. This is a good example of how column comes in handy.

$ column -t snp_report.txt

Here’s the first few lines of the output to be expected:

SnpID     MAF   #AA  #AB  #BB  #NC  %AA   %AB   %BB   %NC
SNP00001  0.33  4    5    4    7    0.20  0.25  0.20  0.35
SNP00002  0.53  7    7    4    2    0.35  0.35  0.20  0.10
SNP00003  0.28  3    5    7    5    0.15  0.25  0.35  0.25
...

This file shows the number of alleles, their proportions and the minor allele frequency for each snp.

The genotype_report.txt file holds the same information as above, except it’s based on each genotype instead, and has a callrate column, which is a measure of the proportion of individuals that have a valid allele (not an NC).

Here’s a snippet of the file:

ID      CallRate  #AA  #AB  #BB  #NC  %AA   %AB   %BB   %NC
geno19  0.660000  11   9    13   17   0.22  0.18  0.26  0.34
geno15  0.740000  11   6    20   13   0.22  0.12  0.40  0.26
geno6   0.840000  15   17   10   8    0.30  0.34  0.20  0.16
...

And finally, there is outliers.txt, which shows the individuals who’s mean of any snp count falls outside 2 standard deviations of the mean.