Workshop 3. Advanced CLI Techniques Workshop Session

In this workshop we will extract annotation information from the human GENCODE v26 gene annotation exclusively using command pipelines, without writing any files to disk. Specifically, we have the following goals:

  1. Identify the unique annotated gene types in the annotation (i.e. protein_coding, lincRNA, miRNA, etc)
  2. Identify the unique gene names of all ribosomal RNAs
  3. Count the number of unique genes annotated as ribosomal RNAs
  4. Use the ensembl gene IDs to download the genomic sequences for all ribosomal RNA genes to a local fasta file

To make writing these commands easier, we will write them into a file and then execute that file using bash.

Task - Create a script file

Create a file using one of the command line text editors, named something like gencode_script.sh. You may name the file whatever you wish, but files that contain shell commands typically have a .sh extension. The file will be empty to begin with.

HINT: If you created a script called gencode_script.sh that has shell commands in it, you may execute the commands in that file by running on the command line:

bash gencode_script.sh

Goal 1: Unique annotated gene types

The GENCODE gene annotation v26 contains the genomic coordinates of all known and well established genes in the human genome. The annotation is encoded in a gene transfer format (GTF) file.

Task - Define a shell variable pointing to the GENCODE v26 GTF URL

Look for the link to the first GTF file on the v26 web page (where the Regions column is CHR). Once you have found the link, define a shell variable named GENCODE in your script file and set it equal to the full URL of the GTF file.

The curl program, similar to wget, accepts a command line argument that is a URL and prints the contents of the file at that URL to the terminal. Notice that the URL you identified above points to a gzipped file. Therefore, if you simply press enter after writing a curl command invocation with that URL, the compressed file contents will be printed to the screen. This probably isn’t what you want. However, we can instead use the | character to send the gzipped output from curl to zcat, which will decompress the input on the fly.

Task - Fetch the annotation using curl and pipe to zcat

Construct a command using curl to fetch the GENCODE URL defined above by writing it onto its own line in your script file. Pipe the output of curl to zcat to decompress it on the fly. Print out the top ten lines of the uncompressed output using another pipe to head, to convince yourself that you have done this correctly.

HINT: Use shell variable expansion when constructing the curl command

HINT: curl may print out some status information to stderr as it runs. You may suppress this output using either -s option to your call or by redirecting stderr to /dev/null

From looking at the top records in the uncompressed GTF file, you will notice in the last column a pattern that looks like gene_type "<type>";. The value of <type> indicates the biotype of this annotated gene. We would like to know what the gene type string is for ribosomal RNA. To do this, we can use grep with the -o command line argument.

Recall that grep can be used to look for specific patterns in text, e.g. grep gene_type annot.gtf will print out the lines in annot.gtf that contain the text gene_type. The first argument to grep is interpreted as a regular expression, which is a language that expresses patterns in text.

Task - grep out the gene_type from the zcat GTF

Look at the man page for grep to identify what the -o argument does. Also read the regular expression information if you are unfamiliar with regular expressions. Use grep and write a regular expression to print out only the gene_type "<type>"; part of the uncompressed GTF output. Add your grep command to the end of the command you previously wrote into your script.

HINT: What types of characters do you expect to be between the “s in <type>?

HINT: Try looking at inverted character classes in the regex documentation.

HINT: You probably don’t want to print out all of the grep output to the screen all at once. How have you looked at only a part of the results before?

Now that we have captured the gene_type "<type>"; from every line where it is found, we would like to know what the unique <type> values are. We can use sort and uniq to do this.

Task - Identify the unique gene types

Read the man page for uniq to understand how it works. Pipe the output of your grep call to the appropriate commands to print out only the unique values to the terminal.

HINT: This command might take a minute or two to complete for the whole file. How might you restrict the input to your new commands to shorten this run time for debugging purposes?

Once you have printed out the unique gene types, look for an entry that looks like it corresponds to ribosomal RNA. Make note of this value, as we will use it in the next steps.

Goal 2. Identify the unique gene names of all ribosomal RNAs

Now that you have identified the gene type for ribosomal RNAs, we will use this information to restrict the section of the annotation to only those of this gene type. Start writing a new command on a new line of your script. We will begin with the same curl and zcat command as above.

Task - Restrict annotations to only those marked as ribosomal RNA

Use grep to filter the lines of the uncompressed GTF file that are annotated as ribosomal RNAs.

HINT: You may prevent your previous command from running every time you run the script by putting a # at the beginning of the line with the command. The # character indicates a comment in bash.

When you inspect the output of these gene annotations, notice that some of the lines have a field gene_name "<gene name>"; toward the end of the line. This is the (somewhat) human-readable gene name for this gene. We are interested in identifying the unique gene names for the ribosomal RNA genes.

Task - Extract just the gene names out of each annotated ribosomal RNA

Using the grep command as we have earlier, extract out just the part of each line that contains the gene_name attribute. Use your strategy from earlier to identify only the unique gene names.

Task - Identify the number of unique gene names

The output from the previous task can be used with wc to identify the number of unique gene names. How many unique gene names are there?

Goal 3. Count the number of unique genes annotated as ribosomal RNAs

In the previous task, you found the unique human readable gene names in the annotation. However, human readable gene names are often unstable and inaccurate. A more reliable way of counting the number of annotated genes is by examining the records that are marked as having the feature type gene. The feature type is the third column of the GTF file, and contains what kind of annotation the line represents, e.g. gene, exon, UTR, etc. By definition, each gene only has a single gene record. Counting the annotations of ribosomal RNAs that are marked as gene feature types, we can get a more accurate number of genes.

Task - Filter ribosomal RNA records of feature type gene

Create a new line in your script to write this new command. Using the annotation records filtered by ribosomal RNA as above, identify only those records that have gene as the feature type.

HINT: There are several ways to do this. One way is to use the cut command with grep to ensure you are matching on the correct strings. You may also consider looking for ways to make a regular expression match the beginning and end of a word. Or you may consider trying to use the whitespace delimiter (tab character) to match only a value in a single column. There are probably other ways to go about this too.

Task - Count the number of gene feature type records

Using the output from above and wc, count the number of gene feature type records.

How many gene records are there for ribosomal RNAs?

How does this compare to the number of unique gene names?

Goal 4. Download the genomic ribosomal RNA gene sequences to a fasta file

Ensembl is a genome browser and database of genomic information, including genome sequence. Ensembl also has its own gene identifier system that is often more stable than human readable gene symbols. This is very convenient when processing genome information programmatically, and GENCODE includes Ensembl gene identifiers as its primary identifier type. Human Ensembl gene identifiers are of the form ENSGXXXXXXXXXXX, where each X is a number 0-9, and there are (presently) eleven digits.

Task - Extract only the Ensembl gene id from each ribosomal RNA gene

Create a new line in your script to begin writing a new command. Starting with the commands above that identify unique gene records, use grep to extract only the portion of the record that corresponds to the Ensembl gene ID.

HINT: All human Ensembl gene IDs start with ENSG and have the same number of characters in them.

Check that the number of unique Ensembl IDs is the same as the total number of Ensembl IDs (i.e. that there are no duplicate gene IDs). Are the they the same?

Ensembl has an RESTful API that allows programmatic access to nearly all of the information in the Ensembl database. One of the endpoints allows automatic extraction of FASTA-formatted sequences for a given Ensembl ID. We can use curl once again to download the genomic FASTA sequence for an Ensembl ID as follows:

$ curl https://rest.ensembl.org/sequence/id/ENSG00000199240?type=genomic -H 'Content-type:text/x-fasta'
>ENSG00000199240 chromosome:GRCh38:1:43196417:43196536:1
GTCTACAGCCATAACACCGTGAATGCACCTGATCTTGTCTGATCTCAGAAGCTAAGCAGG
GTCAGGCCTGGTTAATACTTGGATGGGAGATACTAGCGTAGGATAGAGTGGATGCAGATA

This example, however, only downloads the sequence for a single identifier. We would like to download sequences for every identifier we obtained in the previous step. To do this, we can use either xargs or fim, as described in the videos.

Task - Download all of the sequences for ribosomal RNA genes

Using the Ensembl IDs from the previous step, write either an xargs or fim command that downloads the genomic sequences for each ID. Use curl and follow the URL pattern above with the appropriate substitution. Write the FASTA formatted sequences to a file, named to your liking.

Take a look at the FASTA file using one of head, cat, less, vim, emacs, or ok even nano.

Pause and reflect on how powerful you have become.