Workshop 3. Advanced CLI Techniques Workshop Session¶
In this workshop we will extract annotation information from the human GENCODE v26 gene annotation exclusively using command pipelines, without writing any files to disk. Specifically, we have the following goals:
- Identify the unique annotated gene types in the annotation (i.e.
protein_coding
,lincRNA
,miRNA
, etc) - Identify the unique gene names of all ribosomal RNAs
- Count the number of unique genes annotated as ribosomal RNAs
- Use the ensembl gene IDs to download the genomic sequences for all ribosomal RNA genes to a local fasta file
To make writing these commands easier, we will write them into a file and then execute that file using bash.
Task - Create a script file
Create a file using one of the command line text editors, named something
like gencode_script.sh
. You may name the file whatever you wish, but
files that contain shell commands typically have a .sh
extension. The
file will be empty to begin with.
HINT: If you created a script called gencode_script.sh
that has shell
commands in it, you may execute the commands in that file by running on the
command line:
bash gencode_script.sh
Goal 1: Unique annotated gene types¶
The GENCODE gene annotation v26 contains the genomic coordinates of all known and well established genes in the human genome. The annotation is encoded in a gene transfer format (GTF) file.
Task - Define a shell variable pointing to the GENCODE v26 GTF URL
Look for the link to the first GTF file on the v26 web page (where the
Regions column is CHR). Once you have found the link, define a shell variable
named GENCODE
in your script file and set it equal to the full URL of the
GTF file.
The curl
program, similar to wget
, accepts a command line argument that
is a URL and prints the contents of the file at that URL to the terminal. Notice
that the URL you identified above points to a gzipped file. Therefore, if you
simply press enter after writing a curl
command invocation with that URL,
the compressed file contents will be printed to the screen. This probably isn’t
what you want. However, we can instead use the |
character to send the
gzipped output from curl
to zcat
, which will decompress the input on the
fly.
Task - Fetch the annotation using curl
and pipe to zcat
Construct a command using curl
to fetch the GENCODE URL defined above by
writing it onto its own line in your script file. Pipe the output of curl
to zcat
to decompress it on the fly. Print out the top ten lines of the
uncompressed output using another pipe to head
, to convince yourself that
you have done this correctly.
HINT: Use shell variable expansion when constructing the curl
command
HINT: curl
may print out some status information to stderr
as it
runs. You may suppress this output using either -s
option to your call
or by redirecting stderr
to /dev/null
From looking at the top records in the uncompressed GTF file, you will notice
in the last column a pattern that looks like gene_type "<type>";
. The value
of <type>
indicates the biotype of this annotated gene. We would like to
know what the gene type string is for ribosomal RNA. To do this, we can use
grep
with the -o
command line argument.
Recall that grep
can be used to look for specific patterns in text, e.g.
grep gene_type annot.gtf
will print out the lines in annot.gtf
that
contain the text gene_type
. The first argument to grep
is interpreted
as a regular expression, which is a language that expresses patterns in text.
Task - grep
out the gene_type
from the zcat
GTF
Look at the man
page for grep
to identify what the -o
argument
does. Also read the regular expression information if you are unfamiliar
with regular expressions. Use grep
and write a regular expression to
print out only the gene_type "<type>";
part of the uncompressed GTF
output. Add your grep
command to the end of the command you previously
wrote into your script.
HINT: What types of characters do you expect to be between the “s in
<type>
?
HINT: Try looking at inverted character classes in the regex documentation.
HINT: You probably don’t want to print out all of the grep
output to the
screen all at once. How have you looked at only a part of the results before?
Now that we have captured the gene_type "<type>";
from every line where it
is found, we would like to know what the unique <type>
values are. We can
use sort
and uniq
to do this.
Task - Identify the unique gene types
Read the man
page for uniq
to understand how it works. Pipe the
output of your grep
call to the appropriate commands to print out only
the unique values to the terminal.
HINT: This command might take a minute or two to complete for the whole file. How might you restrict the input to your new commands to shorten this run time for debugging purposes?
Once you have printed out the unique gene types, look for an entry that looks like it corresponds to ribosomal RNA. Make note of this value, as we will use it in the next steps.
Goal 2. Identify the unique gene names of all ribosomal RNAs¶
Now that you have identified the gene type for ribosomal RNAs, we will use this
information to restrict the section of the annotation to only those of this gene
type. Start writing a new command on a new line of your script. We will begin
with the same curl
and zcat
command as above.
Task - Restrict annotations to only those marked as ribosomal RNA
Use grep
to filter the lines of the uncompressed GTF file that are
annotated as ribosomal RNAs.
HINT: You may prevent your previous command from running every time you run
the script by putting a #
at the beginning of the line with the command.
The #
character indicates a comment in bash.
When you inspect the output of these gene annotations, notice that some of the
lines have a field gene_name "<gene name>";
toward the end of the line. This
is the (somewhat) human-readable gene name for this gene. We are interested in
identifying the unique gene names for the ribosomal RNA genes.
Task - Extract just the gene names out of each annotated ribosomal RNA
Using the grep
command as we have earlier, extract out just the part of
each line that contains the gene_name
attribute. Use your strategy from
earlier to identify only the unique gene names.
Task - Identify the number of unique gene names
The output from the previous task can be used with wc
to identify the
number of unique gene names. How many unique gene names are there?
Goal 3. Count the number of unique genes annotated as ribosomal RNAs¶
In the previous task, you found the unique human readable gene names in the
annotation. However, human readable gene names are often unstable and inaccurate.
A more reliable way of counting the number of annotated genes is by examining
the records that are marked as having the feature type gene
. The feature
type is the third column of the GTF file, and contains what kind of annotation
the line represents, e.g. gene, exon, UTR, etc. By definition, each gene only has
a single gene
record. Counting the annotations of ribosomal RNAs that are
marked as gene
feature types, we can get a more accurate number of genes.
Task - Filter ribosomal RNA records of feature type gene
Create a new line in your script to write this new command. Using the
annotation records filtered by ribosomal RNA as above, identify only those
records that have gene
as the feature type.
HINT: There are several ways to do this. One way is to use the cut
command with grep
to ensure you are matching on the correct strings. You
may also consider looking for ways to make a regular expression match the
beginning and end of a word. Or you may consider trying to use the whitespace
delimiter (tab character) to match only a value in a single column. There are
probably other ways to go about this too.
Task - Count the number of gene feature type records
Using the output from above and wc
, count the number of gene feature type
records.
How many gene records are there for ribosomal RNAs?
How does this compare to the number of unique gene names?
Goal 4. Download the genomic ribosomal RNA gene sequences to a fasta file¶
Ensembl is a genome browser and database of genomic information, including genome sequence. Ensembl also has its own gene identifier system that is often more stable than human readable gene symbols. This is very convenient when processing genome information programmatically, and GENCODE includes Ensembl gene identifiers as its primary identifier type. Human Ensembl gene identifiers are of the form ENSGXXXXXXXXXXX, where each X is a number 0-9, and there are (presently) eleven digits.
Task - Extract only the Ensembl gene id from each ribosomal RNA gene
Create a new line in your script to begin writing a new command. Starting
with the commands above that identify unique gene records, use grep
to
extract only the portion of the record that corresponds to the Ensembl gene
ID.
HINT: All human Ensembl gene IDs start with ENSG and have the same number of characters in them.
Check that the number of unique Ensembl IDs is the same as the total number of Ensembl IDs (i.e. that there are no duplicate gene IDs). Are the they the same?
Ensembl has an RESTful API that allows programmatic access to nearly all of
the information in the Ensembl database. One of the endpoints allows automatic
extraction of FASTA-formatted sequences for a given Ensembl ID. We can use
curl
once again to download the genomic FASTA sequence for an Ensembl ID as
follows:
$ curl https://rest.ensembl.org/sequence/id/ENSG00000199240?type=genomic -H 'Content-type:text/x-fasta'
>ENSG00000199240 chromosome:GRCh38:1:43196417:43196536:1
GTCTACAGCCATAACACCGTGAATGCACCTGATCTTGTCTGATCTCAGAAGCTAAGCAGG
GTCAGGCCTGGTTAATACTTGGATGGGAGATACTAGCGTAGGATAGAGTGGATGCAGATA
This example, however, only downloads the sequence for a single identifier. We would like to download sequences for every identifier we obtained in the previous step. To do this, we can use either xargs or fim, as described in the videos.
Task - Download all of the sequences for ribosomal RNA genes
Using the Ensembl IDs from the previous step, write either an xargs
or
fim
command that downloads the genomic sequences for each ID. Use curl
and follow the URL pattern above with the appropriate substitution. Write the
FASTA formatted sequences to a file, named to your liking.
Take a look at the FASTA file using one of head
, cat
, less
,
vim
, emacs
, or ok even nano
.
Pause and reflect on how powerful you have become.