Tuesday, January 18, 2011

Perl Basic


Perl Basics

What is Perl?


      Perl stands for Practical Extraction and Report Language.

      Perl is an interpreted programming language optimized for scanning arbitrary text files, extracting information from those text files, and printing reports based on that information.

      The string-based nature of DNA and protein sequence data makes Perl an obvious choice for many of the simpler problems in computational biology.

How do I install Perl on my Computer?

Perl is an application, just like any other windows applications.  It can be downloaded for free for your use at http://www.perl.com/ This is the main website for perl, and contains many useful links. 

Downloading Perl For Windows

If you are using a windows-based computer, you should go to the active state web site (http://www.activestate.com/) and download the latest version of ActivePerl, which is freely available.  As of this writing, the latest version is ActivePerl 5.8.0 build 804.  Make sure to download the Windows MSI package.   There are installation instructions on the ActiveState website as to how to install ActivePerl.  You should just be able to download the appropriate package and run it, and the installation should automatically occur.

Perl for Other Operating Systems
 
If you need the Mac version of Perl, you can go to http://www.macperl.com/ in order to download Perl.

If you are using a Linux or Unix based operating system, then chances are great that Perl is already installed.    To see if Perl is installed in Linux or Unix, type:

perl –v

If it is installed, you will get back a message giving you information concerning the version you are using; otherwise, it will report back:

command not found


ActiveState Download Page.


Adding Perl to your path

Installation of active state should automatically add Perl to your path.  You can check this by starting a command window (go to start->Run and type in “command”) and typing in the word perl.  If you get a message saying “command not found” then Perl is not in your path.  Otherwise, Perl might start and be awaiting input from you (you can type ctrl-c to exit).

If Perl is not in your path, you can add it by editing the file C:\autoexec.bat.  Adding perl to your path will allow you to run perl by typing in only the command perl and not D:\perl\bin\perl.  To do this, you need to do two things.  First, make a backup copy of the  file c:\autoexec.bat.  This can be done in the dos window by typing

cp c:\autoexec.bat c:\autoexec.bak 

Now we need to edit the file c:\autoexec.bat file.  This can either be done in
notepad which can be found under Start-> Program->Accessories  or by typing in the
dos window:

edit c:\autoexec.bat

Now at the end of this file, add in the following line, substituting the correct location of where Perl is installed.  If you installed ActivePerl using the default locations, then Perl is located in c:\Perl\Bin:

PATH=%PATH%;PATH TO PERL GOES HERE;

PATH=%PATH%;c:\perl\bin;

And save autoexec.bat.  You may now need to reboot your computer for these
changes to take effect.  Note that you only need to do this step once.


Creating Perl Programs

Perl programs can be created using any basic text editor, such as notepad.  Perl programs typically end with a “.pl” extension.  Make sure that if you are saving perl programs in notepad that you check it to save as all file types.  Otherwise your perl programs will have a .txt extension automatically added to them!

What is a Perl Program?


    A program consists of a text file containing a series of Perl statements.
    Perl statements are separated by a semi-colon.
    Leading spaces on a line are ignored.
    Multiple spaces, tabs, and blank lines are ignored.
    Anything following # is ignored.
    Perl is case sensitive.

Your First Perl Program


We will be starting with a simple program to get you started with learning what a Perl program looks like, and how it is created and run.

How do I edit or create a Perl Program?

For now, you can use either notepad, edit, or any other simple text editor to create your perl program.  The editor you choose to use must save your files as plain text.  Perl files should always be saved with a .pl extension at the end.

How do I run a Perl Program?

Perl programs are run by typing perl followed by the program name and any command line options.  You should be running the perl programs from the command window, which is accessed by Start->run and typing “command”.  You should then change the directory (using the command “cd” ) to the directory where the perl program is located.  For instance, if you save the program “hello.pl” in the directory c:\PerlPrograms\ you would type: cd c:\PerlPrograms

You can then run “hello.pl” by typing perl hello.pl

Hello world


Please type in the following program in using the editor as described above.

# This is my first Perl program! 

print (“Hello World!\n”);
print (“This is my first Perl program.\n\n”);


Now that you have it typed in, we need to save and run the program.  Save the program as hello.pl  and then run it by typing perl hello.pl

Understanding Hello World


print() is a function that displays text to the screen.
\n prints a new line to the screen; i.e. it goes to the start of the next line.
# signals the beginning of the comment – everything after it will be ignored
; separates the Perl statements from one another.  Statements do not have to occur on separate lines.  This is done to make it easier for humans to read.

Updating Hello World


When you run a program, it is often desired to use different data.  We will now update the hello world program to use command line arguments that can be used to set different options for a program.  In this specific example, we will allow the user to enter in their name, which then gets printed to the screen.







Edit the hello.pl file so it now looks like the following:

use Getopt::Long;

#usage: perl helloMe.pl –name [NAME]

&GetOptions(“name=s” => \$Name);
# The above line lets the user enter in their name

print (“Hello $Name \n”);
print (“I now have written a perl program using command \n”);
print (“line arguments and variables \n\n”);

Once you have finshed, save this file as helloMe.pl and then run it by typing:

Perl helloMe.pl –name yourname


Understanding the Updated Version


use Getopt::Long;  For now, we will just say that this line is used whenever we want to retrieve command line arguments.  Basically this line tells the perl interpreter that we are using a function that is already defined in another location.

&GetOptions() is a function that retrieves the command line options.  In this example, it retrieves the value directly after the –name when the program is run.  It stores the value entered in in a variable called $Name.  This value can be retrieved any time in the program.  We will talk about variables more in the next perl session.
Perl Data Types

Variables


      A variable is a container that holds one or more values that can change throughout a program.

      There are 3 types of variables in perl:
      Scalar
      Array
      Asscoiative array (hash/lookup table)

Scalar


      A scalar variable holds a single value, which can be a number or a character string.
      Scalar variable have a dollar sign ($) prefix.
      Examples of scalar variables:

$EcoRI = 'GAATTC’;
$len = 6;

Strings

Strings are sequences of characters (like hello).

Single-Quoted Strings


      Text placed between a pair of single quotes is interpreted literally.
      To get a single quote into a single-quoted string, precede it by a backslash (\).
      To get a backslash into a single-quoted sting, precede the backslash by a backslash.
      Examples of strings:

‘hello’    # hello
‘can\’t’   # can’t
‘http:\\\\www’  # http:\\www

 

Double-Quoted Strings


      The double quote “interpolates”  variables between the pair of quotes, which means that the variable names within the string are replaced by their current values.
      Examples

$x = 1;
print ‘$x’;  # will print out $x
print “$x”; # will print out 1

      There are several different escape characters that can be printed out:

\n         Newline
\t          Tab
\\          Backslash
\”         Double quote

Operators on Scalar Variables

      An operator generates a new value from one or more values.

      Arithmetic Operators

            +          Addition
                    Subtraction
*          Multiplication
            **        Exponentiation
            /           Division
            %         Modulus

      Assignment operator (=)
           
            $a = 5;
     $b = “Hello”;
     $c = $a + 2;

      Binary Assignment Operators (+= , -=)
           
            $a += 2; # Equivalent to $a = $a + 2;
     $a -= 2;  # Equivalent to $a = $a - 2;

      Autoincrement and Autodecrement Operators (++, --)

     $a++;         # Equivalent to $a = $a + 1;
     $a--;         # Equivalent to $a = $a – 1;

      String Concatenation Operator (.)

            $exon1 = "CGGCTGCCACGAGTGCAGGCCAG";
     $exon2 = " GCAACGGGATGGTGAGCCGCT";
     $mRNA = $exon1 . $exon2;
     #mRNA is CGGCTGCCACGAGTGCAGGCCAGGCAACGGGATGGTGAGCCGCT
      Numeric and String Comparison Operators
           
            Comparison                             Numeric                      String
            Equal                                          ==                                eq
            Not Equal                                   !=                                 ne
            Less Than                                   <                                  lt
            Greater Than                              >                                  gt
            Less than or equal to                  <=                                le
            Greater than or equal to             >=                                ge
           
      chop() and chomp() string operators

      The chop() operator removes the last character from the string.
           
                        $x = “world”;
     chop($x);     # $x is now “worl”

      The chomp() operator removes the newline character from the end of the string.

            $a = “Hello World\n”;
          chomp ($a);        # $a is now “Hello World”

Arrays


      An array is an ordered list of data.
      An array variable name begins with an at sign (@).

Array Assignment


            @array1 = (1, 2, 3, 4, ‘five’);   #array1 has 5 elements
      @array2 = @array1;                #array2 is the same as array1
@array3 = (@array1, ‘six’);       #array 3 is (1,2,3,4,
    #            ‘five’,’six’)

Element Access

An array element can be accessed by a numeric index.
Array elements are numbered using sequential integers, beginning at zero.

            @array1 = (1, 2, 3);
     $a = @array1[2];   # $a is 3
     $array1[1] = 6;    # @array1 is (1, 6, 3)
     $array1[0] ++;     # @array1 is (2, 6, 3)
              

push() and pop() operators

push() will insert an element at the end of the array.  pop() will remove and return the last element from the array.

            @array1 = (1, 3, 5);
      push(@array1, 2, 4, 6);    # @array1 is (1, 3, 5, 2, 4, 6)
      $last = pop(@array1);      # @array1 is (1, 3, 5, 2, 4)
   # and $last is 6

shift() and unshift() operators

shift() removes and returns the first element of the array.  unshift() inserts an element to the beginning of the array.

            unshift(@array1, 7, 8);       # @array1 is (7, 8, 1, 3, 5, 2, 4)
      $first = shift(@array1);      # @array1 is (8, 1, 3, 5, 2, 4)

reverse() reverses the order of the elements of the array.

            @array1 = (1, 2, 3);
      @array2 = reverse(@array1);         # @array2 is (3, 2, 1);
# @array1 is (1, 2, 3)

join() glues the elements of a list together with a glue string between each element.

            @array1 = ('Hello', 'to', 'me.');
      $sentence = join(' ', @array1);     # $sentence is 'Hello to me.'


Retrieving array length

Using an array in a context where a scalar value is expected returns the length of the array.

            @array1 = (1, 2, 3, 4);
     $a = @array1;           #$a is 4

Associative Arrays


    Also called hashes or look up tables.

    The associative array variable begins with a percent sign (%).

    An associative array is similar to an array.  The difference is that an array uses integers as index values but the associative array uses arbitrary scalars called keys.  The keys are used to retrieve the corresponding values from the associative array.

     %person = ('name', 'John Doe',
               'age',  50,
              'salary',  100000);

    Associative arrays are not ordered.  Whenever we want to find some specific values, the keys are used to find them.  The curly brackets {} are used instead of the square brackets [] for associative arrays.

            $a = $person{'age'};         # a is 50
     $person{'city'} = 'Chicago'; # key 'city' and value
# 'Chicago' are added

Operators for Associative Arrays

    keys() returns a list of all the current keys in the associative array.

            @list = keys(%person);  # @list gets ('name', 'age',
                             #  'salary', 'city'), but not
                             # necessarily in any order

    values() returns a list of all the current values in the associative array.

            @list1 = values(%person);# @list gets ('John Doe', 50,
                             # 100000, 'Chicago'), but not
                             # necessarily in that order.

    each() returns a key-value pair as a two element list.

            while(($a, $b) = each(%person)) {
          print "$a \t $b \n";
            }



The following will be printed:

                 name      John Doe
          age       50
          salary    100000
          city      Chicago

delete removes both key and value from the associative array.

            delete $person('city');



Perl Control Structures


Statement Blocks


    A sequence of statements can be grouped together by enclosing them in curly braces to form a statement block.  Each statement in the block will be executed in the order in which they appear.

    Structure

{
     first statement;
     second statement;
     …
     last statement;
}

if/elsif/else statement


    Structure

            if(expression) {
                        statement 1;
                        statement 2;
                        …
            }
            else {
                        statement a;
                        statement b;
            }

    If expression evaluates to true, the first statement block will be executed and the second will be skipped.  If expression evaluates to false, only the second block will be executed.  the else and its associated statement block are optional.

    If expression is something undefined or has the value zero, then it evaluates to false.  Anything else is true.

    If there are more than two possible choices, one or more elsif branches can be added.
           


    Structure

if(expression 1) {
          …
     }
     elsif(expression 2) {
          …
     }
     elsif(expression 3) {
          …
     }
     else {
          …
     }

    Example

            $day = 1;
     if($day == 1) {
           print ("Sunday\n");
     }
     elsif($day == 2) {
           print ("Monday\n");
     }
     elsif($day == 3) {
          print ("Tuesday\n");
     }
     else {
           print("Day must be Sunday, Monday, or Tuesday\n");
     }
     # Sunday will be displayed
           

while loop


    Structure

            while (expression) {
          statement 1;
          statement 2;
          …
     }

    The statement block (loop) iterates as long as the expression evaluates to true.  When the expression becomes false, program flow passes to the line of code immediately following the loop.


    Example

            $x = 5;
     $y = 3;
     while($x > $y) {
          print("inside while: $x * $y = ", $x * $y, "\n");
          $y++;
     }
     print("outside while: x = $x; y = $y\n");

       This will display:

            inside while: 5 * 3 = 15
     inside while: 5 * 4 = 20
     outside while: x = 5; y = 5


for loop


    Structure:

            for (initial_exp; test_exp; increment_exp) {
          statement 1;
          statement 2;
          …
     }

    Example

            for($x = 0; $x < 5; $x++) {
          print(" $x * $x = ", $x*$x, "\n");
     }

       This will display:

            0 * 0 = 0;
     1 * 1 = 1;
     2 * 2 = 4;
     3 * 3 = 5;
     4 * 4 = 16;



foreach statement


    Structure

            foreach $s (@list) {
          statement 1;
          statement 2;
     }

    foreach takes a list of values and assigns them one at a time to a scalar variable and executes a statement block.

    Example:

            @array = (1, 5, 7, 9, 12, 20, 3, 6, 11);  
     print ("@array\n");

     foreach $a (@array) {
          $a += 2;
     }
    
     print("@array\n");

      The program will display:

            1 5 7 9 12 20 3 6 11
     3 7 9 11 14 22 5 8 13

last statement


    The last statement causes the loop to terminate.

    Example:

     @array = ("A".."Z");
     for($index = 0; $index < @array; $index++) {
          if($array[$index] eq "G") {
              last;
          }
     }
     print("$array[$index]\n"):

     The program displays:
            G

next statement


    The next statement causes execution to skip to the next iteration of the loop.

    Example:

     @array = (0..9);
     for($index = 0; $index < @array; $index++) {
          if(($array[$index] % 3) != 0)  {
              next;
          }
          print ("$array[$index] is divisible by 3\n");
     }

   This program will display
           
            0 is divisible by 3
     3 is divisible by 3
     6 is divisible by 3
     9 is divisible by 3

subroutines


    If you have code that will be used in several places, a subroutine can be written in order to allow the code to only be written once.  Subroutine definitions can be placed anywhere in the program.

            sub subroutine_name {
          …
     }

    Arguments:  Values called arguments can be passed to a subroutine.  If the subroutine invocation is followed by a list inside parentheses, the list is automatically assigned to a special variable @_.

    Return Values: The return value of a subroutine is the value of the return statement or of the last expression evaluated in the subroutine.


    Example:

            $area = areaOfCircle(5);
     print ("$area\n");
    
     sub areaOfCircle{
          $radius = $_[0];
          return(3.1415 * ($radius ** 2));
     }

     The program displays:
            78.7375


Example Program using for, while, foreach, and if/elsif/else


The following program creates 10 random DNA sequences each of length 100.  Type it in, save it as randSeq.pl and run it by typing perl randSeq.pl

     # This program will create random DNA sequence data

     $len = 100;
     $numSeq = 10;
     @seqArray = "";         # initially empty
    
     for($I = 1; $I <= $numSeq; $I++) {
        $seq = "";           # initially empty

        $j = 1;
        while($j <= $len) {
           $val = rand(4);   # return a random value
           if($val < 1)      { $seq = $seq . "A"; }
          elsif($val < 2)   { $seq = $seq . "C"; }
           elsif($val < 3)   { $seq .= "G"; }
           else              { $seq .= "T"; }
           $j++;
        }  # End of while $j

        push(@seqArray, $seq);    # add the new sequence
     } # End of for $I
    
     print ("Here are the sequences:\n");

     foreach $currSeq (@seqArray) {
        print("$currSeq\n");
     }

Exercises


A.      Using the HelloMe.pl example as a guide, update the randSeq.pl program to allow the user to specify the number of sequences to create, and the length of these sequences. Note: when you want to read in two or more command line options, separate them by a comma.  For instance, if we wanted to read in both the first name and last name from the user, we could have written:

&GetOptions("firstName=s" => \$first, "lastName=s" => \$last);

Then the variable $first holds the first name and $last holds the last name.

A.      Here are some useful functions:

length($str) returns the length of the string in $str
substr($str, begin, len) gets a substring from $str beginning at offset begin with length len.

Using these functions, we can retrieve the reverse complement of a sequence using the following code:

$seq = "AGCTAATT";
$seqRC = "";     # initially empty
$strLen =  length($seq); 

# The for loop will start at the end of the string
# and decrement from there

# REMINDER: Perl arrays start at 0, not 1!

for($I = ($strLen-1); $I >= 0; $I--) {
     $tmpStr = substr($seq, $I, 1);
    
if($tmpStr eq "A")      { $seqRC .= "T"; }
     elsif($tmpStr eq "C")   { $seqRC .= "G"; }
     elsif($tmpStr eq "G")   { $seqRC .= "C"; }
     else                    { $seqRC .= "A"; }
   }

   print($seqRC);

      This will print out
            AATTAGCT

      Update the code from exercise A by including the above code as a subroutine.  Then
      create a second array which will hold the reverse complements of these sequences.

Basic I/O in Perl

Command Line options


We have previously used examples to read in command line options.  Remember, that there are two necessary things when reading in command line options:

1)   use Getopt::Long;   must be located near the top of the program to tell the perl interpreter to use the package that defines the GetOptions function.

1.       &GetOptions(); must be called to retrieve the options from the command line.  For instance, we can call it with

&GetOptions("firstName=s" => \$first, " lastName=s" =>\$last);

We notice that for each command line option there is a pair of lists:  the option that the user enters in and the variable into which the result is stored.  If we had updated the helloMe.pl to include first and last names, it could be called as follows:

perl helloMe.pl -firstName Eric -lastName Rouchka

So now $first = Eric and $last = Rouchka.

Command Line arguments


    Perl programs can be written that take command-line arguments.  When such a program is run, the command-line arguments go into a special array @ARGV (argument vector).

    Example Suppose you have a program called test.pl :

     $argc = @ARGV;          # get the number of arguments
     print "There are $argc args on the command line.\n";
     print join(' ', @ARGV) . "\n";

    When the program is run by using
           
            Perl test.pl a b c

    The program will display:

            There are 3 args on the command line.
     a b c

Files


    When you are dealing with file operations such as opening a file, reading from a file or writing to a file, Perl uses a name (not necessarily related to the real name of the file) to represent that file.  This name is called the filehandle.  More accurately, a filehandle is the name for an I/O connection between your Perl process and the outside world.


Opening a file

    If you want to open a file called "temp.txt" and associate it with a filehandle called FILE1, you would do the following:

            open(FILE1, "temp.txt");

    Files can be opened for reading, writing or appending

     open(FILE1, "<temp.txt");  # open temp.txt for reading
     open(FILE1, ">temp.txt");  #open temp.txt for writing
     open(FILE1, ">>temp.txt"); #open temp.txt for appending


Reading from a file

    To read a line from a file opened for reading, enclose the filehandle associated with the file in angle brackets ( < and > ).

            $line = <FILE1>;     # Reads a single line from the file
                     # specified by FILE1 and stores it in $line

    Examples

            open (FILE1, "test.txt");
    
     while($line = <FILE1>) {
          print $line;
     }
    
     close(FILE1);







    There's another way to do this.  If you don't assign the line you read to anything, it will be assigned to a special scalar variable $_.

           
     while(<FILE1>) {
          print $_;
     }
     …

    Actually you can even leave out $_ in the print statement.

            while(<FILE1>) {
          print;
     }


Writing to a file

To write to a file, you open it for writing or appending and then specify the corresponding filehandle with the print function.

The following program copies the content of a file called "test.txt" into a file called "test2.txt".

            open(FILE1, "test.txt");
     open(FILE2, ">test2.txt");
    
     while(<FILE1>) {
          print FILE2 $_;
     }

     close(FILE1);

     close(FILE2);
Example program writing to files


      The following program creates a random sequence using a length specified by the user and stores it into the file seq.fa.  Type in this program and save it as createSeq.pl and then run it by typing:

            perl createSeq.pl -length [LENGTH]


# This program will create random DNA sequence data
# and write it to a file

# USAGE: perl createSeq.pl -length [LENGTH]

use Getopt::Long;

&GetOptions("length=s" => \$len);


$seq = "";
for($j = 1; $j <= $len; $j++) {
   $val = rand(4);                 # returns a random value
   if $val < 1)      { $seq .= "A"; }
   elsif($val < 2)   { $seq .= "C"; }
   elsif($val < 3)   { $seq .= "G"; }
   else              { $seq .= "T"; }
}


open(OUTFILE, ">seq.fa");                     # open seq.fa for writing

print OUTFILE ">Temporary FASTA Sequence ";   # Add in a sequence
print OUTFILE ("of length $len\n");           # descriptor

$numLines = $len / 50;                        # we want to put 50
if(($len % 50) != 0) { $numLines++; }         # characters per line

for($i = 0; $i < $numLines; $i++) {              #print out the sequence
   print OUTFILE substr($seq, $i * 50, 50), "\n"; # one line at a time
}

close(OUTFILE);                            # close seq.fa

Example Program of Writing to Files and Subroutines


The following program will read in a fasta file, create its reverse complement, and save the reverse complement to a new file.  Type it in and save it as reverseFasta.pl.  If you have run the above program, you should have created a fasta file seq.fa.  Now run this program by typing in:  perl reverseFasta.pl -fn seq.fa  This will create a new file seq.fa.RC that will have the reverse complement of the sequence.



# This program will read in a fastA file and create a new
# fasta file that is its reverse complement.
# USAGE: perl reverseFasta.pl -fn [FILENAME]

use Getopt::Long;

&GetOptions("fn=s" => \$fileName);

sub reverseComplement {     # SUBROUTINE DEFINITION TO
   $tmpSeq = $_[0];         # CREATE THE REVERSE COMPLEMENT
                            # OF A SEQUENCE
   $seqRC = "";
   $strLen = length($tmpSeq);

   for($I = ($strLen-1); $I >= 0; $I--) {
       $tmpStr = substr($seq, $I, 1);
       if($tmpStr eq "A")    { $seqRC .= "T"; }
       elsif($tmpStr eq "C") { $seqRC .= "G"; }
       elsif($tmpStr eq "G") { $seqRC .= "C"; }
       else                  { $seqRC .= "A"; }
   }
   return($seqRC);
}

$seq = "";
open(INFILE, "$fileName");          # open a file for reading

$description = <INFILE>;            # retrieve the fasta description
chomp($description);                # remove the "\n"

while($line = <INFILE>) {
   chomp($line);                    # remove the "\n"
   $seq .= $line;
}

close(INFILE);

$newSeq = reverseComplement($seq);  # CALL THE SUBROUTINE

$newDesc = $description . " -- REVERSE COMPLEMENT ";

$newFileName = ">" . $fileName . ".RC";
open(OUTFILE, $newFileName);        # open a file for writing
print OUTFILE ("$newDesc\n");       # print the seq descriptor

$len = length($newSeq);             # This time, we will write
$numLines = $len / 60;              # 60 characters per line
if(($len % 60) != 0) { $numLines++; }

for($i = 0; $i < $numLines; $i++) {
   print OUTFILE substr($newSeq, $i * 60, 60), "\n";
}
close(OUTFILE);                     # close the new data file
Regular Expressions

    Regular expressions are what make Perl so attractive to the computational biology community.  Regular expressions can be used to find patterns in strings, such as looking for specific promoters or start codons in a DNA sequence.

    A regular expression is a pattern or template to be matched against a string.

    To relate a regular expression to a string, the pattern binding operator(=~) is used.

Matching (m//)


    The matching operator (m// or just //) is used to find patterns in a string.
    For example, we want to test if a string contains the sequence ATG:

            $dnaStr = 'TTCGATGCCAC';

     if($str =~ /ATG/) {
          print ("ATG found.\n");
     }
     else {
          print ("ATG not found.\n");
     }

     This program displays:

            ATG found.

    The case of the characters in a string can be ignored by appending an i to the matching operator.

            /bbb/                # will not match "AAA BBB CCC"
            /bbb/i               # will match "AAA BBB CCC"


Substitutions (s///)


    The substitution operator (s///) is used to change strings.  To use it, put the old string between the first and second / and the new string you want to change it to between the second and third /.




    For example, we want to change ATG to TGA:

$str = 'TTCGATGCCAC';
$str =~ s/ATG/TGA/;
     print "$str\n";

     This program displays:
            TTCGTGACCAC

    The program above only changes the first occurrence of ATG to TGA.  If you think there might be more than one ATG and want to change all of them, you can append a g to the substitution operator.

            $str =~ s/ATG/TGA/g;

Translations (tr///)



The translation operator(tr///) is used to change individual characters.  (Note: substitution is for changing strings.)  To use it, you put the old character(s) between the first and second / and the new character you want to change them to between the second and the third /.  The translation operator will translate all occurrences of the old character(s) to the new character.

For example, if we want to change all of the T's in a string to U:

            $str = 'TTCGATGCCAC';
     $str =~ tr/T/U;
     print "$str\n";

     This program displays:
            UUCGAUGCCAC

You can translate multiple characters at a time.

            $str =~ tr/AG/P/;    # all A's and G's in $str are P now

However, if more than replacement character is given, only the first is used.

            $str =~ tr/AG/PL/;   # all A's and G's in $str are P now





How to Create More Complicated Patterns


    There are some other more complicated patterns that we may want to find.  For instance, in computational biology, we may be interested in locating the coding sequence in an mRNA.  In such a case, we will be looking for a pattern that begins with ATG (the start codon) and ends with one of three stop codons (TAA, TAG, or TGA).  It may also be desired to separate the alphabetic from numeric characters in a string sequence.

Single-Character Patterns:

    The dot "." will match any character except the newline (\n);

    Example: /a.b/ matches "aab", "abb", "acb", …

Character class

      A character class is a list of characters between a pair of square brackets([]).  A matched string must have at least one of the characters.

      Example: /[abcd]/ matches a, b, c or d, but not e.

[0123456789]  # match any single digit
[0-9]         # match any single digit
[a-z]         # match any single lowercase letter
[a-zA-Z0-9]   # match any single letter or digit

Negated Character Class

      A negated character class is similar to a character class, but has a leading "^" immediately after the left bracket.  This character class matches any single character that is not in the list.

      Example:

            [^0-9]        # match any single non-digit
     [^aeiouAEIOU] #match any single consonant

Predefined Character Classes

            \d   a digit                 same as [0-9]
     \D   non-digit               same as [^0-9]
     \w   word character          same as [a-zA-Z0-9_]
     \W   non-word character      same as [^a-zA-Z0-9_]
     \s   white space character   same as [ \r\t\n\f]
     \S   non space character     same as [^ \r\t\n\f]

Special Multipliers

      Special multipliers indicate how many times the character to its left should be matched.

*          0 or more times
+          1 or more times
            ?          0 or 1 time

      Example:

            [A+CGC?A] # match one or more A's followed by CG,
              # followed by an optional G followed by an A

General Multipliers

      General multipliers indicate how many times the character to its left should be matched.

      Examples

/A{3}/        # match exactly 3 A's
/A{3,}/       # match 3 or more A's
/A{3,8}/      # match 3 to 8 A's

      The transcription factor binding site for the SSP protein is:

GGCGGCGGCTGGCTAGGG

      using general multipliers, we can create a regular expression for this pattern as follows:

/{(GGC), 3}T{G,2}CTA{G,3}/

Alternation

      Alternation allows you to match one out of several different alternatives.  The alternatives are separated by a vertical bar (|).

      Examples:

            /song|blue/   # match either 'song' or 'blue'
     /a|b|c/       # match a, b or c. (same as /[abc]/)




      The GATA-1 transcription factor binding site is defined by a T or an A, followed by GATA, followed by an A or a G.  Using alternation, we can create a regular expression for this as follows:

            /(T|A)GATA(A|G)/

Anchoring patterns

      ^ matches the beginning of a string, while $ matches the end of a string.

      Examples

            /^this/   # matches 'this one' but not 'watch this'
     /this$/   # matches 'watch this' but not 'this one'

Pattern Memory


      So now that you know how to match characters, you need a way to find out what was matched by storing or saving the matching portions.

      Putting a pair of parentheses around any pattern will allow the part of the string matched by the pattern to be remembered and stored into a special variable called $1.  If there are multiple patterns, they are stored in $2, $3, …).

      Example: 

      The following program looks for the GATA-1 binding site, stores it in $1, and prints it out.

$seq = "AAAGAGAGGGATAGAATAGAGATGATAAGAAA";
$seq ~= /((T|A)GATA(A|G))/;
print "$1\n";

This program will print TGATAA

      Perl also has a few special variables to help you know what matched and what did not. 

$&       the part of the string that actually matched the pattern.
$`         everything before the match
$'         everything after the match




      Example:

$seq = "AAAGAGAGGGATAGAATAGAGATGATAAGAAA";
$seq ~= /(T|A)GATA(A|G)/;
print "$`\n";
print "$&\n";
print "$'\n";

This program will display:

AAAGAGAGGGATAGAATAGAGA
TGATAA
GAAA




No comments: