Perl Basics
What is Perl?
●
Perl stands for
Practical Extraction and Report Language.
●
Perl is an
interpreted programming language optimized for scanning arbitrary text files,
extracting information from those text files, and printing reports based on
that information.
●
The string-based
nature of DNA and protein sequence data makes Perl an obvious choice for many
of the simpler problems in computational biology.
How do I install Perl on my Computer?
Perl is an application, just
like any other windows applications. It
can be downloaded for free for your use at http://www.perl.com/
This is the main website for perl, and contains many useful links.
Downloading Perl For Windows
If you are using a
windows-based computer, you should go to the active state web site (http://www.activestate.com/)
and download the latest version of ActivePerl, which is freely available.
As of this writing, the latest version is ActivePerl 5.8.0 build 804.
Make sure to download the Windows MSI package. There are
installation instructions on the ActiveState website as to how to install
ActivePerl. You should just be able to download the appropriate package
and run it, and the installation should automatically occur.
Perl for Other Operating Systems
If you need the Mac version
of Perl, you can go to http://www.macperl.com/
in order to download Perl.
If you are using a Linux or
Unix based operating system, then chances are great that Perl is already
installed. To see if Perl is installed
in Linux or Unix, type:
perl –v
If it is installed, you will
get back a message giving you information concerning the version you are using;
otherwise, it will report back:
command not found
ActiveState Download Page.
Adding Perl to your path
Installation of active state
should automatically add Perl to your path.
You can check this by starting a command window (go to start->Run and
type in “command”) and typing in the word perl.
If you get a message saying “command not found” then Perl is not in your
path. Otherwise, Perl might start and be
awaiting input from you (you can type ctrl-c to exit).
If Perl is not in your path,
you can add it by editing the file C:\autoexec.bat. Adding perl to your path will allow you to
run perl by typing in only the command perl and not D:\perl\bin\perl. To do this,
you need to do two things. First, make a
backup copy of the file c:\autoexec.bat. This can be
done in the dos window by typing
cp
c:\autoexec.bat c:\autoexec.bak
Now we need to edit the file c:\autoexec.bat file. This can
either be done in
notepad which can be found
under Start-> Program->Accessories
or by typing in the
dos window:
edit
c:\autoexec.bat
Now
at the end of this file, add in the following line, substituting the correct
location of where Perl is installed. If
you installed ActivePerl using the default locations, then Perl is located in
c:\Perl\Bin:
PATH=%PATH%;PATH
TO PERL GOES HERE;
PATH=%PATH%;c:\perl\bin;
And save
autoexec.bat. You may now need to reboot your computer for
these
changes to take effect. Note that you only need to do this step once.
Creating Perl Programs
Perl programs can be created
using any basic text editor, such as notepad.
Perl programs typically end with a “.pl” extension. Make sure that if you are saving perl
programs in notepad that you check it to save as all file types. Otherwise your perl programs will have a .txt
extension automatically added to them!
What is a Perl Program?
■
A program
consists of a text file containing a series of Perl statements.
■
Perl statements
are separated by a semi-colon.
■
Leading spaces on
a line are ignored.
■
Multiple spaces,
tabs, and blank lines are ignored.
■
Anything
following # is ignored.
■
Perl is case
sensitive.
Your First Perl Program
We will be starting with a
simple program to get you started with learning what a Perl program looks like,
and how it is created and run.
How do I edit or create a Perl
Program?
For now, you can use either
notepad, edit, or any other simple text editor to create your perl
program. The editor you choose to use
must save your files as plain text. Perl
files should always be saved with a .pl extension at the end.
How do I run a Perl Program?
Perl programs are run by
typing perl
followed by the program name and any command line options. You should be running the perl programs from
the command window, which is accessed by Start->run and typing
“command”. You should then change the
directory (using the command “cd” ) to the directory where the perl program is
located. For instance, if you save the
program “hello.pl” in the directory c:\PerlPrograms\ you would type: cd
c:\PerlPrograms
You can then run “hello.pl” by typing perl hello.pl
Hello world
Please type in the following
program in using the editor as described above.
# This is my first Perl program!
print (“Hello World!\n”);
print (“This is my first Perl
program.\n\n”);
Now that you have it typed
in, we need to save and run the program.
Save the program as hello.pl and then run
it by typing perl hello.pl
Understanding Hello World
print() is a function that displays text to the screen.
\n
prints a new line to the screen; i.e. it goes to the start of the next line.
# signals
the beginning of the comment – everything after it will be ignored
; separates the Perl
statements from one another. Statements
do not have to occur on separate lines.
This is done to make it easier for humans to read.
Updating Hello World
When you run a program, it is
often desired to use different data. We
will now update the hello world program to use command line arguments that can
be used to set different options for a program.
In this specific example, we will allow the user to enter in their name,
which then gets printed to the screen.
Edit the hello.pl file so it now looks like the following:
use Getopt::Long;
#usage: perl helloMe.pl –name [NAME]
&GetOptions(“name=s” => \$Name);
# The above line lets the user enter in
their name
print (“Hello $Name \n”);
print (“I now have written a perl
program using command \n”);
print (“line arguments and variables
\n\n”);
Once you have finshed, save this file as helloMe.pl and then run it by typing:
Perl helloMe.pl –name yourname
Understanding the Updated Version
use Getopt::Long; For now, we
will just say that this line is used whenever we want to retrieve command line
arguments. Basically this line tells the
perl interpreter that we are using a function that is already defined in
another location.
&GetOptions() is a function that retrieves the command line
options. In this example, it retrieves
the value directly after the –name when the program is run. It stores the value entered in in a variable
called $Name. This value can be retrieved any
time in the program. We will talk about
variables more in the next perl session.
Perl Data Types
Perl Data Types
Variables
●
A variable is a
container that holds one or more values that can change throughout a program.
●
There are 3 types
of variables in perl:
●
Scalar
●
Array
●
Asscoiative array
(hash/lookup table)
Scalar
●
A scalar variable
holds a single value, which can be a number or a character string.
●
Scalar variable
have a dollar sign ($) prefix.
●
Examples of
scalar variables:
$EcoRI =
'GAATTC’;
$len = 6;
Strings
Strings are sequences of
characters (like hello).
Single-Quoted Strings
●
Text placed
between a pair of single quotes is interpreted literally.
●
To get a single
quote into a single-quoted string, precede it by a backslash (\).
●
To get a
backslash into a single-quoted sting, precede the backslash by a backslash.
●
Examples of
strings:
‘hello’ # hello
‘can\’t’ # can’t
‘http:\\\\www’ # http:\\www
Double-Quoted Strings
●
The double quote
“interpolates” variables between the
pair of quotes, which means that the variable names within the string are
replaced by their current values.
●
Examples
$x = 1;
print
‘$x’; # will print out $x
print “$x”; #
will print out 1
●
There are several
different escape characters that can be printed out:
\n Newline
\t Tab
\\ Backslash
\” Double quote
Operators on Scalar Variables
●
An operator
generates a new value from one or more values.
●
Arithmetic
Operators
+ Addition
●
Subtraction
* Multiplication
** Exponentiation
/ Division
% Modulus
●
Assignment
operator (=)
$a = 5;
$b
= “Hello”;
$c
= $a + 2;
●
Binary
Assignment Operators (+= , -=)
$a += 2; # Equivalent to $a = $a + 2;
$a
-= 2; # Equivalent to $a = $a - 2;
●
Autoincrement
and Autodecrement Operators (++, --)
$a++; # Equivalent to $a = $a + 1;
$a--; # Equivalent to $a = $a – 1;
●
String
Concatenation Operator (.)
$exon1 =
"CGGCTGCCACGAGTGCAGGCCAG";
$exon2
= " GCAACGGGATGGTGAGCCGCT";
$mRNA
= $exon1 . $exon2;
#mRNA is CGGCTGCCACGAGTGCAGGCCAGGCAACGGGATGGTGAGCCGCT
●
Numeric and
String Comparison Operators
Comparison Numeric String
Equal == eq
Not Equal != ne
Less Than < lt
Greater Than > gt
Less than or equal to <= le
Greater than or equal to >= ge
●
chop() and chomp() string operators
●
The chop() operator removes the last character from the
string.
$x = “world”;
chop($x); #
$x is now “worl”
●
The chomp() operator removes the newline character from the end
of the string.
$a = “Hello
World\n”;
chomp
($a); # $a is now “Hello World”
Arrays
●
An array is an
ordered list of data.
●
An array variable
name begins with an at sign (@).
Array Assignment
@array1 = (1, 2, 3, 4, ‘five’);
#array1 has 5 elements
@array2 = @array1;
#array2 is the same as array1
@array3
= (@array1, ‘six’); #array 3 is (1,2,3,4,
# ‘five’,’six’)
Element Access
An array element can be
accessed by a numeric index.
Array elements are numbered
using sequential integers, beginning at zero.
@array1 = (1, 2, 3);
$a
= @array1[2]; # $a is 3
$array1[1]
= 6; # @array1 is (1, 6, 3)
$array1[0]
++; # @array1 is (2, 6, 3)
push() and pop() operators
push() will insert an element at the end of the array. pop() will remove and return the last element from the
array.
@array1 = (1, 3, 5);
push(@array1, 2, 4, 6); #
@array1 is (1, 3, 5, 2, 4, 6)
$last = pop(@array1); #
@array1 is (1, 3, 5, 2, 4)
# and $last is 6
shift() and unshift() operators
shift() removes and returns the first element of the
array. unshift() inserts an element to the beginning of the array.
unshift(@array1, 7, 8); #
@array1 is (7, 8, 1, 3, 5, 2, 4)
$first =
shift(@array1); # @array1 is (8, 1,
3, 5, 2, 4)
reverse() reverses the order of the elements of the array.
@array1 = (1, 2, 3);
@array2 = reverse(@array1); # @array2 is (3, 2, 1);
#
@array1 is (1, 2, 3)
join() glues the elements of a list together with a glue string between each
element.
@array1 = ('Hello', 'to', 'me.');
$sentence = join(' ',
@array1); # $sentence is 'Hello to
me.'
Retrieving array length
Using an array in a context
where a scalar value is expected returns the length of the array.
@array1 = (1, 2, 3, 4);
$a
= @array1; #$a is 4
Associative Arrays
■
Also called
hashes or look up tables.
■
The associative
array variable begins with a percent sign (%).
■
An associative
array is similar to an array. The
difference is that an array uses integers as index values but the associative
array uses arbitrary scalars called keys.
The keys are used to retrieve the corresponding values from the
associative array.
%person
= ('name', 'John Doe',
'age', 50,
'salary', 100000);
■
Associative
arrays are not ordered. Whenever we want
to find some specific values, the keys are used to find them. The curly brackets {} are used instead of the
square brackets [] for associative arrays.
$a = $person{'age'}; # a is 50
$person{'city'}
= 'Chicago'; # key 'city' and value
#
'Chicago' are added
Operators for Associative Arrays
■
keys() returns a list of all the current keys in the
associative array.
@list = keys(%person); # @list gets ('name', 'age',
# 'salary', 'city'), but not
#
necessarily in any order
■
values() returns a list of all the current values in the
associative array.
@list1 = values(%person);# @list gets
('John Doe', 50,
# 100000, 'Chicago'), but not
#
necessarily in that order.
■
each() returns a key-value pair as a two element list.
while(($a, $b) = each(%person)) {
print
"$a \t $b \n";
}
The
following will be printed:
name John Doe
age 50
salary 100000
city Chicago
delete removes both key and value from the associative array.
delete $person('city');
Perl Control
Structures
Statement Blocks
■
A sequence of
statements can be grouped together by enclosing them in curly braces to form a
statement block. Each statement in the
block will be executed in the order in which they appear.
■
Structure
{
first statement;
second statement;
…
last statement;
}
if/elsif/else statement
■
Structure
if(expression) {
statement 1;
statement 2;
…
}
else {
statement a;
statement b;
}
■
If expression
evaluates to true, the first statement block will be executed and the second
will be skipped. If expression evaluates
to false, only the second block will be executed. the else and its associated statement block
are optional.
■
If expression is
something undefined or has the value zero, then it evaluates to false. Anything else is true.
■
If there are more
than two possible choices, one or more elsif branches can be added.
■
Structure
if(expression
1) {
…
}
elsif(expression
2) {
…
}
elsif(expression
3) {
…
}
else
{
…
}
■
Example
$day = 1;
if($day
== 1) {
print
("Sunday\n");
}
elsif($day
== 2) {
print
("Monday\n");
}
elsif($day
== 3) {
print
("Tuesday\n");
}
else
{
print("Day
must be Sunday, Monday, or Tuesday\n");
}
#
Sunday will be displayed
while loop
■
Structure
while (expression) {
statement
1;
statement
2;
…
}
■
The statement
block (loop) iterates as long as the expression evaluates to true. When the expression becomes false, program
flow passes to the line of code immediately following the loop.
■
Example
$x = 5;
$y
= 3;
while($x
> $y) {
print("inside
while: $x * $y = ", $x * $y, "\n");
$y++;
}
print("outside
while: x = $x; y = $y\n");
This will display:
inside while: 5 * 3 = 15
inside
while: 5 * 4 = 20
outside
while: x = 5; y = 5
for loop
■
Structure:
for (initial_exp; test_exp;
increment_exp) {
statement
1;
statement
2;
…
}
■
Example
for($x = 0; $x < 5; $x++) {
print("
$x * $x = ", $x*$x, "\n");
}
This will display:
0 * 0 = 0;
1
* 1 = 1;
2
* 2 = 4;
3
* 3 = 5;
4
* 4 = 16;
foreach statement
■
Structure
foreach $s (@list) {
statement
1;
statement
2;
}
■
foreach takes a list of values and assigns them one at a time
to a scalar variable and executes a statement block.
■
Example:
@array = (1, 5, 7, 9, 12, 20, 3, 6, 11);
print
("@array\n");
foreach
$a (@array) {
$a
+= 2;
}
print("@array\n");
The program will display:
1 5 7 9 12 20 3 6 11
3
7 9 11 14 22 5 8 13
last statement
■
The last
statement causes the loop to terminate.
■
Example:
@array
= ("A".."Z");
for($index
= 0; $index < @array; $index++) {
if($array[$index]
eq "G") {
last;
}
}
print("$array[$index]\n"):
The program displays:
G
next statement
■
The next
statement causes execution to skip to the next iteration of the loop.
■
Example:
@array
= (0..9);
for($index
= 0; $index < @array; $index++) {
if(($array[$index]
% 3) != 0) {
next;
}
print
("$array[$index] is divisible by 3\n");
}
This program will display
0 is divisible by 3
3
is divisible by 3
6
is divisible by 3
9
is divisible by 3
subroutines
■
If you have code
that will be used in several places, a subroutine can be written in order to
allow the code to only be written once.
Subroutine definitions can be placed anywhere in the program.
sub subroutine_name {
…
}
■
Arguments: Values called
arguments can be passed to a subroutine.
If the subroutine invocation is followed by a list inside parentheses,
the list is automatically assigned to a special variable @_.
■
Return Values: The return value of a subroutine is the value of the
return statement or of the last expression evaluated in the subroutine.
■
Example:
$area = areaOfCircle(5);
print
("$area\n");
sub
areaOfCircle{
$radius
= $_[0];
return(3.1415
* ($radius ** 2));
}
The program displays:
78.7375
Example Program using for, while, foreach, and if/elsif/else
The following program creates
10 random DNA sequences each of length 100.
Type it in, save it as randSeq.pl and run it by typing perl
randSeq.pl
#
This program will create random DNA sequence data
$len
= 100;
$numSeq
= 10;
@seqArray
= ""; # initially empty
for($I
= 1; $I <= $numSeq; $I++) {
$seq =
""; # initially empty
$j = 1;
while($j <= $len) {
$val = rand(4); # return a random value
if($val < 1) { $seq = $seq . "A"; }
elsif($val < 2) { $seq = $seq . "C"; }
elsif($val < 3) { $seq .= "G"; }
else {
$seq .= "T"; }
$j++;
} #
End of while $j
push(@seqArray, $seq); # add the new sequence
}
# End of for $I
print
("Here are the sequences:\n");
foreach
$currSeq (@seqArray) {
print("$currSeq\n");
}
Exercises
A. Using the HelloMe.pl example as a guide, update the randSeq.pl program to allow the user to specify the number of
sequences to create, and the length of these sequences. Note: when you want to
read in two or more command line options, separate them by a comma. For instance, if we wanted to read in both
the first name and last name from the user, we could have written:
&GetOptions("firstName=s"
=> \$first, "lastName=s" => \$last);
Then
the variable $first holds the first name and $last holds the last name.
A. Here are some useful functions:
length($str) returns the length of the string in $str
substr($str,
begin, len) gets a substring from $str beginning at offset begin with length len.
Using
these functions, we can retrieve the reverse complement of a sequence using the
following code:
$seq =
"AGCTAATT";
$seqRC =
""; # initially empty
$strLen = length($seq);
# The for loop
will start at the end of the string
# and
decrement from there
# REMINDER:
Perl arrays start at 0, not 1!
for($I =
($strLen-1); $I >= 0; $I--) {
$tmpStr
= substr($seq, $I, 1);
if($tmpStr eq
"A") { $seqRC .=
"T"; }
elsif($tmpStr
eq "C") { $seqRC .=
"G"; }
elsif($tmpStr
eq "G") { $seqRC .=
"C"; }
else { $seqRC .= "A"; }
}
print($seqRC);
This will print out
AATTAGCT
Update the code from exercise A by
including the above code as a subroutine.
Then
create a
second array which will hold the reverse complements of these sequences.
Basic I/O in Perl
Command Line options
We have previously used examples
to read in command line options.
Remember, that there are two necessary things when reading in command
line options:
1) use
Getopt::Long; must be located near the top of the program
to tell the perl interpreter to use the package that defines the GetOptions function.
1. &GetOptions(); must be called to retrieve the options from the
command line. For instance, we can call
it with
&GetOptions("firstName=s"
=> \$first, " lastName=s" =>\$last);
We
notice that for each command line option there is a pair of lists: the option that the user enters in and the
variable into which the result is stored.
If we had updated the helloMe.pl to include first and last names, it could be called
as follows:
perl
helloMe.pl -firstName Eric -lastName Rouchka
So now $first
= Eric and $last =
Rouchka.
Command Line arguments
■
Perl programs can
be written that take command-line arguments.
When such a program is run, the command-line arguments go into a special
array @ARGV (argument vector).
■
Example Suppose
you have a program called test.pl :
$argc
= @ARGV; # get the number of
arguments
print
"There are $argc args on the command line.\n";
print
join(' ', @ARGV) . "\n";
■
When the program
is run by using
Perl test.pl a b c
■
The program will
display:
There are 3 args on the command line.
a
b c
Files
■
When you are
dealing with file operations such as opening a file, reading from a file or
writing to a file, Perl uses a name (not necessarily related to the real name
of the file) to represent that file. This
name is called the filehandle.
More accurately, a filehandle is the name for an I/O connection between
your Perl process and the outside world.
Opening a file
■
If you want to
open a file called "temp.txt" and associate it with a filehandle
called FILE1, you would do the following:
open(FILE1, "temp.txt");
■
Files can be
opened for reading, writing or appending
open(FILE1,
"<temp.txt"); # open
temp.txt for reading
open(FILE1,
">temp.txt"); #open temp.txt
for writing
open(FILE1,
">>temp.txt"); #open
temp.txt for appending
Reading from a file
■
To read a line
from a file opened for reading, enclose the filehandle associated with the file
in angle brackets ( < and > ).
$line = <FILE1>; # Reads a single line from the file
#
specified by FILE1 and stores it in $line
■
Examples
open (FILE1, "test.txt");
while($line
= <FILE1>) {
print
$line;
}
close(FILE1);
■
There's another
way to do this. If you don't assign the
line you read to anything, it will be assigned to a special scalar variable $_.
…
while(<FILE1>)
{
print
$_;
}
…
■
Actually you can
even leave out $_ in the print statement.
while(<FILE1>) {
print;
}
Writing to a file
To write to a file, you open
it for writing or appending and then specify the corresponding filehandle with
the print function.
The following program copies
the content of a file called "test.txt" into a file called "test2.txt".
open(FILE1, "test.txt");
open(FILE2,
">test2.txt");
while(<FILE1>)
{
print
FILE2 $_;
}
close(FILE1);
close(FILE2);
Example program writing to files
●
The following
program creates a random sequence using a length specified by the user and
stores it into the file seq.fa. Type in this program and save
it as createSeq.pl and then run it by typing:
perl createSeq.pl -length [LENGTH]
# This program will create random DNA sequence data
# and write it to a file
# USAGE: perl createSeq.pl -length [LENGTH]
use Getopt::Long;
&GetOptions("length=s" => \$len);
$seq = "";
for($j = 1; $j <= $len; $j++) {
$val = rand(4); # returns a random value
if $val < 1) { $seq .= "A"; }
elsif($val < 2) { $seq .= "C"; }
elsif($val < 3) { $seq .= "G"; }
else { $seq .= "T"; }
}
open(OUTFILE, ">seq.fa"); # open seq.fa for writing
print OUTFILE ">Temporary FASTA Sequence "; # Add in a sequence
print OUTFILE ("of length $len\n"); # descriptor
$numLines = $len / 50; # we want to put 50
if(($len % 50) != 0) { $numLines++; } # characters per line
for($i = 0; $i < $numLines; $i++) { #print out the sequence
print OUTFILE
substr($seq, $i * 50, 50), "\n"; # one line at a time
}
close(OUTFILE); # close seq.fa
Example Program of Writing to Files and Subroutines
The following program will
read in a fasta file, create its reverse complement, and save the reverse
complement to a new file. Type it in and
save it as reverseFasta.pl. If you have
run the above program, you should have created a fasta file seq.fa. Now run this
program by typing in: perl
reverseFasta.pl -fn seq.fa This will create a new file
seq.fa.RC that will have the reverse
complement of the sequence.
# This program will read in a fastA file and create a new
# fasta file that is its reverse complement.
# USAGE: perl reverseFasta.pl -fn [FILENAME]
use Getopt::Long;
&GetOptions("fn=s" => \$fileName);
sub reverseComplement {
# SUBROUTINE DEFINITION TO
$tmpSeq = $_[0]; # CREATE THE REVERSE COMPLEMENT
# OF A SEQUENCE
$seqRC = "";
$strLen =
length($tmpSeq);
for($I = ($strLen-1);
$I >= 0; $I--) {
$tmpStr =
substr($seq, $I, 1);
if($tmpStr eq
"A") { $seqRC .=
"T"; }
elsif($tmpStr eq
"C") { $seqRC .= "G"; }
elsif($tmpStr eq
"G") { $seqRC .= "C"; }
else { $seqRC .= "A"; }
}
return($seqRC);
}
$seq = "";
open(INFILE, "$fileName"); #
open a file for reading
$description = <INFILE>; #
retrieve the fasta description
chomp($description);
# remove the "\n"
while($line = <INFILE>) {
chomp($line); # remove the "\n"
$seq .= $line;
}
close(INFILE);
$newSeq = reverseComplement($seq); # CALL THE SUBROUTINE
$newDesc = $description . " -- REVERSE COMPLEMENT ";
$newFileName = ">" . $fileName . ".RC";
open(OUTFILE, $newFileName);
# open a file for writing
print OUTFILE ("$newDesc\n"); #
print the seq descriptor
$len = length($newSeq);
# This time, we will write
$numLines = $len / 60;
# 60 characters per line
if(($len % 60) != 0) { $numLines++; }
for($i = 0; $i < $numLines; $i++) {
print OUTFILE
substr($newSeq, $i * 60, 60), "\n";
}
close(OUTFILE);
# close the new data file
Regular Expressions
Regular Expressions
■
Regular
expressions are what make Perl so attractive to the computational biology
community. Regular expressions can be
used to find patterns in strings, such as looking for specific promoters or
start codons in a DNA sequence.
■
A regular expression
is a pattern or template to be matched against a string.
■
To relate a
regular expression to a string, the pattern binding operator(=~)
is used.
Matching (m//)
■
The matching
operator (m// or just //) is used to find patterns in a string.
■
For example, we
want to test if a string contains the sequence ATG:
$dnaStr = 'TTCGATGCCAC';
if($str
=~ /ATG/) {
print
("ATG found.\n");
}
else
{
print
("ATG not found.\n");
}
This program displays:
ATG found.
■
The case of the
characters in a string can be ignored by appending an i to the matching
operator.
/bbb/ #
will not match "AAA BBB CCC"
/bbb/i #
will match "AAA BBB CCC"
Substitutions (s///)
■
The substitution
operator (s///) is used to change strings. To use it, put the old string between the
first and second / and the new string you want to change it to between the
second and third /.
■
For example, we
want to change ATG to TGA:
$str =
'TTCGATGCCAC';
$str =~
s/ATG/TGA/;
print
"$str\n";
This program displays:
TTCGTGACCAC
■
The program above
only changes the first occurrence of ATG to TGA. If you think there might be more than one ATG
and want to change all of them, you can append a g to the substitution
operator.
$str =~ s/ATG/TGA/g;
Translations (tr///)
The translation operator(tr///)
is used to change individual characters.
(Note: substitution is for changing strings.) To use it, you put the old character(s)
between the first and second / and the new character you want to change them to
between the second and the third /. The
translation operator will translate all occurrences of the old character(s) to
the new character.
For example, if we want to
change all of the T's in a string to U:
$str = 'TTCGATGCCAC';
$str
=~ tr/T/U;
print
"$str\n";
This program displays:
UUCGAUGCCAC
You can translate multiple
characters at a time.
$str =~ tr/AG/P/; # all A's and G's in $str are P now
However, if more than
replacement character is given, only the first is used.
$str =~ tr/AG/PL/; # all A's and G's in $str are P now
How to Create More Complicated Patterns
■
There are some
other more complicated patterns that we may want to find. For instance, in computational biology, we
may be interested in locating the coding sequence in an mRNA. In such a case, we will be looking for a
pattern that begins with ATG (the start codon) and ends with one of three stop
codons (TAA, TAG, or TGA). It may also
be desired to separate the alphabetic from numeric characters in a string
sequence.
Single-Character Patterns:
■
The dot
"." will match any character except the newline (\n);
■
Example: /a.b/ matches "aab", "abb", "acb", …
Character class
●
A character class
is a list of characters between a pair of square brackets([]). A matched string must have at least one of
the characters.
●
Example: /[abcd]/ matches a, b, c or d, but not e.
[0123456789] # match any single digit
[0-9] # match any single digit
[a-z] # match any single lowercase letter
[a-zA-Z0-9] # match any single letter or digit
Negated Character Class
●
A negated character
class is similar to a character class, but has a leading "^"
immediately after the left bracket. This
character class matches any single character that is not in the list.
●
Example:
[^0-9] #
match any single non-digit
[^aeiouAEIOU] #match any single consonant
Predefined Character Classes
\d a
digit same as [0-9]
\D non-digit same
as [^0-9]
\w word character same as [a-zA-Z0-9_]
\W non-word character same as [^a-zA-Z0-9_]
\s white space character same as [ \r\t\n\f]
\S non space character same as [^ \r\t\n\f]
Special Multipliers
●
Special
multipliers indicate how many times the character to its left should be
matched.
* 0 or more times
+ 1 or more times
? 0 or 1
time
●
Example:
[A+CGC?A] # match one or more A's followed by CG,
#
followed by an optional G followed by an A
General Multipliers
●
General
multipliers indicate how many times the character to its left should be
matched.
●
Examples
/A{3}/ # match exactly 3 A's
/A{3,}/ # match 3 or more A's
/A{3,8}/ # match 3 to 8 A's
●
The transcription
factor binding site for the SSP protein is:
GGCGGCGGCTGGCTAGGG
●
using general
multipliers, we can create a regular expression for this pattern as follows:
/{(GGC),
3}T{G,2}CTA{G,3}/
Alternation
●
Alternation
allows you to match one out of several different alternatives. The alternatives are separated by a vertical
bar (|).
●
Examples:
/song|blue/ # match either 'song' or 'blue'
/a|b|c/ # match a, b or c. (same as /[abc]/)
●
The GATA-1
transcription factor binding site is defined by a T or an A, followed by GATA,
followed by an A or a G. Using
alternation, we can create a regular expression for this as follows:
/(T|A)GATA(A|G)/
Anchoring patterns
●
^ matches the
beginning of a string, while $ matches the end of a string.
●
Examples
/^this/ #
matches 'this one' but not 'watch this'
/this$/ # matches 'watch this' but not 'this one'
Pattern Memory
●
So now that you
know how to match characters, you need a way to find out what was matched by
storing or saving the matching portions.
●
Putting a pair of
parentheses around any pattern will allow the part of the string matched by the
pattern to be remembered and stored into a special variable called $1. If there are multiple patterns, they are
stored in $2, $3, …).
●
Example:
●
The following
program looks for the GATA-1 binding site, stores it in $1, and prints it out.
$seq =
"AAAGAGAGGGATAGAATAGAGATGATAAGAAA";
$seq ~=
/((T|A)GATA(A|G))/;
print
"$1\n";
This
program will print TGATAA
●
Perl also has a
few special variables to help you know what matched and what did not.
$& the part of the string that actually
matched the pattern.
$` everything before the match
$' everything after the match
●
Example:
$seq =
"AAAGAGAGGGATAGAATAGAGATGATAAGAAA";
$seq ~=
/(T|A)GATA(A|G)/;
print
"$`\n";
print
"$&\n";
print
"$'\n";
This
program will display:
AAAGAGAGGGATAGAATAGAGA
TGATAA
GAAA
No comments:
Post a Comment