What Is Bioinformatics?: 2015

Wednesday, December 30, 2015

End of Year Review (Part I)

My PI has requested that we answer a few questions as a reflection of the past year and as a mental preparation for the coming year. I figured it would be worthwhile to record at least a few of my answers here. This seems like a useful exercise and my answers are decently in depth. So the post isn't overwhelming, I'm going to dedicate a different post for each question.

What are the goals and milestones you met in 2015?

I have a vivid memory from Christmas break last year of asking a friend for advice on how to tackle a dataset that was larger than I knew how to handle. I had been doing scientific programming for about a month, and was worried that I was going to have to pick up C instead of continuing with Python. My code was slow and it seemed like weird things were happening with my memory usage. Not only did I have no clue what I was doing, but I also didn’t have any idea how I was supposed to go about learning what to do.

Part of the reason I remember the conversation is because the dataset I was struggling with was the same one that I use almost everyday now. Today, thinking back to my coding style a year ago makes me want to shake my head for a variety of reasons. I took the time to find some code from last year; it was as bad as I remembered. Here’s a small list of problems (in no particular order) with the code:

Most of the scripts have virtually no comments or documentation.
Everything is hard-coded. Occasionally, there are chunks of data stored in the script. There are random strings referring to data files throughout the code. There’s no way I would reliably have found and changed all those strings if I wanted to use different data.
There’s only minimal use of libraries like numpy and scipy; I didn’t feel very comfortable with them yet.
Several scripts have no functions whatsoever, just 100 lines of straight code. All of these scripts were obviously capable of doing exactly one thing.
Other scripts have one main function that take a couple basic parameters (like the names of files). Unfortunately, I didn’t yet know how to incorporate command-line options into a python script. At the bottom of the script there’s a long list of strings corresponding to the files on which I had run the code. As I performed different runs of the script, I would comment out previous strings…

These coding problems had very real effects on my ability to do computational research. It was constantly difficult for me to remember which script I needed to bring up to run a particular analysis. Once I found the code, I often forgot to switch out one or more of the strings in my code, which meant it ran incorrectly and I had to rerun the code. And at the end of the week, I often didn’t remember how I managed to produce the data I had stored.

My primary goal for last year was to become comfortable with the Python scientific ecosystem. I’ve devoted a lot of time and effort to finding and exploring new computational tools in order to do so. While I’ve only explored a tiny fraction of all the available resources, I believe I achieved my goal insofar as I can now reliably reproduce my research and I know where to look when I need answers to a new question. Below is a list of tools that have become indispensable, and how they have allowed me to reach specific computational milestones:

Surprisingly, https://www.reddit.com/r/Python has been a great resource. The community is active and new packages, updates, and other resources are constantly posted and discussed. I now have a way to stay up to date with virtually everything Python has to offer as a data analysis tool.
Numpy/Scipy/Pandas/Matplotlib/sklearn are now part of my everyday tools. I understand their APIs, documentation, and am generally familiar with their capabilities and limitations. This means that I can much more quickly import, manipulate, analyze, plot, and save large datasets than I could last year. A great example is using Numpy and linear algebra for computing Pearson’s correlation ~500 times faster than I was previously.
Discovering the Ipython/Jupyter Notebook has been the biggest game changer for me. Using the notebook allows me to essentially record everything I do in a month. A lot of code that I write has a one-and-done functionality. There’s no reason for me to store it as its own script. Now I don’t have to. I can make notes to myself in the same location as my code, save graphs directly below the code that made them, and section off my projects in a way that makes sense. Basically, I now have a method for reproducibility that has stopped my code directory structure from becoming an incomprehensible disaster. Organization is not my natural forte, so having the capability to use the Notebooks feels like a major accomplishment.
Another large benefit of the Notebook is that the .py files I do write are readable and reusable. When I write some code I know I’m going to need again, I take a bit of time and rewrite it in a .py file as a class. I properly break the code up into modular functions and remember to write a docstring for each function. Inevitably, when I go to rerun my code, I’m going to want to try something slightly different than I did the first time. Now, all of my core logic is abstracted away in a .py file. All I have to do is instantiate the class I want inside of the Notebook and choose the functions I want to use this time. All of my slight variations from run to run are immediately saved in the notebook.

While I’m still a long way away from a data analysis expert, I can honestly say that I no longer feel on the brink of disaster like I did this time last year. Improvements can be made in each of the four points listed, but I’m happy to say that I believe I’m on the right track.

Monday, December 28, 2015

Teach Your Fourth Grader Python (Make a Game)

Welcome to the second post on introducing your child to Python! You should have Anaconda (or another version of Python) installed on your computer before continuing. As I mentioned before, I wrote this code for and with a class of 4th graders. Did all of them understand all of the concepts? No, of course not; that wasn't really the point. All of the students thought that the game was fun. All of the students gained a little bit of insight into the technology they use everyday. That's the point.

Similarly, I'm not going to attempt to fully describe everything that's going on in the code here either. The hope is that I can given enough of a description that you'll be able to either guess or google how to change a line if you wanted to do something similar. What if you decided you wanted to do division problems instead? What if you wanted to add a third player? Exercises like these are fun ways to familiarize yourself with what's going on in a script.

Remember, this is for Python 3 and won't work with Python 2. Leave a comment if you want the 2.7 version for any reason. Here is the code in full:

import time
import random

#Introduction
print("Hello, Players!")
print("Welcome to Multiplication Battles!")
print("")

#Input information
player1 = input("What is your name, Player 1? ")
player1_points = 0
player2 = input("What is your name, Player2? ")
player2_points = 0 
length_per_round = int(input("How many seconds do you want each round to be? "))
highest_num = int(input("What's the largest number you want to multiply? "))

#Launch Player1's game
print("")
input("{}! Hit Enter to start your time.".format(player1))
current_time = time.time()
end_time = current_time + length_per_round
while current_time <= end_time:
    
    #Do math
    a = random.randint(0, highest_num)
    b = random.randint(0, highest_num)
    c = a*b
    answer = int(input("What is {}x{}? ".format(a,b)))
  
    #Check if right or wrong
    if answer == c:
        print("You're Correct!")
        player1_points += 1
    else:
        print("Sorry. {}x{}={}.".format(a,b,c))
  
    #Must update the time at the end of each loop.
    current_time = time.time()

print("You scored {} points!".format(player1_points))

#Get ready for Player2
print("")
input("{}! Hit Enter to start your time.".format(player2))
current_time = time.time()
end_time = current_time + length_per_round
while current_time <= end_time:
  
    #Do math
    a = random.randint(0, highest_num)
    b = random.randint(0, highest_num)
    c = a*b
  
    #Check if right or wrong
    answer = int(input("What is {}x{}? ".format(a,b)))
    if answer == c:
        print("You're Correct!")
        player2_points += 1
    else:
        print("Sorry. {}x{}={}.".format(a,b,c))
    
    #Must update the time at the end of each loop.
    current_time = time.time()
print("You scored {} points!".format(player2_points))

#Decide a winner
print("")
if player1_points > player2_points:
  print("{} wins! Good job!".format(player1))
elif player2_points > player1_points:
  print("{} wins! Good job!".format(player2))
else:
  print("It was a tie! You two should play again!")

Now we'll go through a section by section breakdown. Any line that begins with a # is a comment, and doesn't effect the way the code runs. Think of them as notes to yourself that the computer ignores.

import time
import random

These two lines give us access to other pieces of code we'll use later in the script. This page lists all of the 'libraries' that come standard with Python.

#Introduction
print("Hello, Players!")
print("Welcome to Multiplication Battles!")
print("")print ""

This is the beginning of the game. The keyword 'print' is how text shows up in the console when running python. Notice that everything being printed in inside of quotation marks. This just indicates that we are printing some normal text, technically called a 'string'. Other things, like integers, for example, can also be printed.

#Input information
player1 = input("What is your name, Player 1? ")
player1_points = 0
player2 = input("What is your name, Player 2? ")
player2_points = 0 
length_per_round = int(input("How many seconds do you want each round to be? "))
highest_num = int(input("What's the largest number you want to multiply? "))

Before launching into questions, the game collects a little bit of information about who's playing. The first thing we do is create 'player1' by asking the game players to input the name of the first player. Whatever the player types in will be stored inside of the variable 'player1'. We need to keep track of how many points the first person has, so we'll create another variable called player1_points. Since the game hasn't started, the first player currently has 0 points. The reason that the words on the left hand side of the equals sign are called variables is because their values can change over the course of the program. When the first player correctly answers a question, they'll gain a point and we'll overwrite the score stored in 'player1_points'. We'll see how this works later. Variables are a pretty complicated concept for 4th graders. The idea was one of the biggest difficult while teaching this course. If anyone has difficulty understanding what's going on here, I highly recommend checking out 'How to Automate the Boring Stuff'.

We then repeat the process for the second player, and collect their information. Lastly, we need to know how long we want each round to last and what the largest number is that we feel comfortable multiplying. Maybe we want to go easy and play a game for 30 seconds only multiplying up to 10. Or we could go for something more challenging and give ourselves a minute to multiply numbers up to 20. Whatever you want to do. Notice the 'int' surrounding the 'input' on the last two lines. 'int' stands for 'integer', and we use this because we want Python to treat our entries in these lines like numbers instead of 'strings'.

#Launch Player1's game
print("")
input("{}! Hit Enter to start your time.".format(player1))
current_time = time.time()
end_time = current_time + length_per_round

Line number 3 is a nice way to pause the game until the first player is ready. But what's going on with the '{}'? It's just a placeholder in the text. When it is printed to the console, the first player's name will be displayed in the sentence. So, if I were playing, for example, the prompt would read, 'Jessime! Hit Enter to start your time.' As soon as I hit enter, the next line of code would execute.
The next line is where one of our imports come in handy. We're just going to access the computer's clock to store the current time. We need to do this because we want the player to answer questions for 30 seconds, or however long was chosen. So we need to know what time the player started. We then calculate when the player's time is up (by adding up our current time and how long we want to play), and store it in the 'end_time' variable.

while current_time <= end_time:
    
    #Do math
    a = random.randint(0, highest_num)
    b = random.randint(0, highest_num)
    c = a*b
    answer = int(input("What is {}x{}? ".format(a,b)))

This is where the first player begins getting questions. The first line is what's known as a while loop. Loops are another tough subject; use the link and other resources on Google to get more information if necessary. What's happening here is quite straightforward though. We're telling Python 'Hey, execute this next bit of code over and over until the first player runs out of time'. The 'next bit of code' means all of the lines which start 4 spaces over. The code that we're going to execute over and over is:

Generate a question
Ask the player the question
Give the player a point if they answer correctly
Tell the player the correct answer if they miss the question
Update the time and see if the 30 seconds is up or not

These five steps will be repeated. Lines 4-7 in the code above knock out the first two steps. We create 'a' and 'b' by using the 'random' library to generate a random integer in the range of 0 to our highest number (12, for example). The correct solution to the question will be 'a' times 'b', which we store as 'c'. Then we ask the question to the player and store their guess as 'answer'.

        #Check if right or wrong
        if answer == c:
            print("You're Correct!")
            player1_points += 1
        else:
            print("Sorry. {}x{}={}.".format(a,b,c))

Note: This should be indented 4 spaces, but the syntax highlighter won't allow is. Look at the whole code or the player2 example for proper indentation.
At this point, one of two things can happen. Either the player is right, or they're wrong. Line two says, if the answer the player gave is equivalent to the correct answer, execute the next bit of code. Again, the 'next bit of code' means all of the lines that are indented 4 spaces over. So, if the player get's the question correct, we're going to do two things. We print to the console letting the player know they answered correctly. We then increment 'player1_points' by one. That is, if the question is the first correctly answered, 'player1_points' goes from 0 to 1. If the first player has 4 points already, then 'player1_points' will now be equal to 5. If the player did not correctly answer the question, we print the correct equation.

    #Must update the time at the end of each loop.
    current_time = time.time()

This line is absolutely key to the while loop. It's the last line of the while loop, which means, since we're in a loop, we're about to jump back up to line 22 (of the whole code). Once we are back at 22, we're going to reevaluate if we have any time left. The only way that the 'current_time' variable will be accurate is if we update it by 'current_time = time.time()'. If we don't have this line, the while condition will always be true, you'll enter an infinite loop, and you'll have to shutdown python manually. Make sure to have this line.

Once the current time is greater than the end time, we'll exit out of the while loop, and the first player's turn will be over.

print("You scored {} points!".format(player1_points))

We can let the first player know how many points they scored. It may not look like it from the line count, but at this point, we're pretty much finished. A majority of the rest of the code is a duplicate of what we just did, but for the second player.

#Get ready for Player2
print("")
input("{}! Hit Enter to start your time.".format(player2))
current_time = time.time()
end_time = current_time + length_per_round
while current_time <= end_time:
  
    #Do math
    a = random.randint(0, highest_num)
    b = random.randint(0, highest_num)
    c = a*b
  
    #Check if right or wrong
    answer = int(input("What is {}x{}? ".format(a,b)))
    if answer == c:
        print("You're Correct!")
        player2_points += 1
    else:
        print("Sorry. {}x{}={}.".format(a,b,c))
    
    #Must update the time at the end of each loop.
    current_time = time.time()
print("You scored {} points!".format(player2_points))

Literally the only change here is that I've replaced 'player1' with 'player2'. Now that each player has had their turn, the only thing left to do is figure out who the winner is.

#Decide a winner
print("")
if player1_points > player2_points:
  print("{} wins! Good job!".format(player1))
elif player2_points > player1_points:
  print("{} wins! Good job!".format(player2))
else:
  print("It was a tie! You two should play again!")

This block of code should look very similar to the part where we decided if the players correctly answered the question or not. The only difference is that there are now three things that can happen:

Player 1 can win.
Player 2 can win.
The players can tie.

Only one of these three statements will print, and we make the decision by comparing 'player1_points' to 'player2_points' and responding appropriately.

And that's it! Once the scripts executes this last line, it will automatically end and the game is over.

Teach Your Fourth Grader Python (Installing Anaconda)

It's really never too early to start teaching your child to code. You may think, "What's the point of teaching my kid to code so long before they have any idea what they want to do with their life?" The topic is worth a post of its own, but here are a couple of articles by Beth Werrell and Dan Crow that address how useful "computational thinking" is in a world where everything relies on software. Another possible issue that might come to mind is, "How do I teach my child to code when I don't even know how?" Thankfully, there are already some really great resources out there to help you learn and teach at the same time. One of my favorite books is Invent Your Own Computer Games with Python which is free online. Another resource, which I haven't used at all but generally gets good reviews is Teach Your Kids to Code: A Parent-Friendly Guide to Python Programming.

I had the opportunity earlier this semester to team up with a friend of mine to each a few hours worth of Python to her 4th grade class. Let me point out that this isn't nearly enough time to learn a programming language. It's enough time to see what coding is and complete a small project to get the students interested in coding. In 4th grade kids are still learning/perfecting their multiplication tables, so we decided to make a multiplication 2-player game.

We'll go over two things. First, I'll show you how to easily install Python (it can be a little bit of a challenge on your own). Then, in the next post, I'll present the code that I wrote for class, and describe the basics of what it does. Again, this isn't going to provide enough information for you to "learn how to code". It's a fun first project to show you why learning to code is worthwhile. I would suggest using one of the resources I linked above to figure out all of the details about how the code works.

Installing Anaconda

Anaconda is a version of Python nicely bundled into a convenient package. Here are a list of steps to get up and running

1. Go to https://www.continuum.io/downloads to download the program.

2. You'll see options for Windows, OSX, and Linux. Choose the one that works for you.

3. You'll also have the option between Python 2.7 and 3.5. You should choose 3.5 unless you have a very specific reason for choosing 2.7.

4. Begin the download by clicking the 3.5 Graphical Installer I've highlighted in the red rectangle above.

5. Once the download is complete, open the installer.

6. Just follow the defaults and walk through the installer.

7. Agree to the Terms of Service.

8. Install for Just me.

9. Accept the default 'Destination Folder'

10. Click the 'Install' button.

11. Once the install as been completely finished, search for 'Spyder' in your computer's search. Look for an icon similar to the one below:

12. Note: Ignore that my icon has Python 2.7, it's what I use because of legacy issues.

13. Select the icon to launch Spyder. It will take a minute to launch.

14. Congratulations! You're ready to start coding.

In case you're wondering Spyder is what's known as an IDE. At its simplest, it is just a place to write and run your code. If yours doesn't look exactly the same as mine, don't worry. It's going to look different depending on whether you're using Windows, OSX or Linux.

Monday, October 5, 2015

Calculating a Pearson Correlation Matrix

One of the great things about graduate school is you get to constantly realize how ignorant you are about how things work. If the domain name weren't already taken, I would think about moving my stuff over to seriously.dontusethiscode.com. It can get a bit depressing thinking about all the stuff there is to know. But if you think about it a bit more, it's really just awesome how exciting our world is. I'm glad I've got this opportunity to learn and explore.

Anyway, I found a question on reddit that reminded me of a post I did a while ago. One of the key points of 'Simple Sequence Similarity' was calculating a Pearson correlation matrix. While I realize that it isn't exactly the same as the problem on the reddit link, the nested iteration made me realize that I should probably show some improved code that I currently use to calculate the correlation matrix. By "improved", I mean ~500x faster.

Here's a link to the notebook.

As a fun bonus (which I won't dive into right now) there are at least four different examples of Ipython magic throughout the notebook. As always, I welcome any comments or suggestions!

Cheers!

Friday, August 28, 2015

Bot Battle

For those of you who don't program, it may be hard to understand just how slow a process it usually is. Movies and other media don't help this misunderstanding. As Mark Zuckerberg put it, "If they were really making a movie, it would have been of me, sitting at a computer coding for two hours straight". Similarly, here's a visual representation of the difference.

Every once is a while though, it's actually useful to be able to throw together a quick script, or to generate a plot while you're having a discussion with your colleague. It's not something I'm particularly skilled at since I don't do it much. So is there some fun way to practice coding quickly?

Let's make a bot battle competition!

I wanted this to be a pretty straightforward weekend project, so I decided to stick with a simple game that I love a lot. Reversi, also known as Othello is a great game that even little kids can easily understand. The game itself hasn't been solved though, so it's far from trivial. It's the perfect game for this sort of project.

I'm not going to run through all of the code or anything, but the repo is available if you want to try it out yourself. A more detailed explanation of how to run the code is there, as well as the general rules of the game. If you don't feel like playing yourself, here's a quick clip of what a Series might look like:

If you have any suggestions or need help setting up the game, just let me know!

Cheers!

Friday, August 7, 2015

The Perceptron

Today in lab meeting, during our bi-weekly "Idea Potluck", I dropped the ball while explaining how perceptrons work. We were running short on time, so I rushed and straight-up got flustered. It was unfortunate, since perceptrons are really cool, and Wikipedia has a great example (here).

To over compensate, I've written a tutorial to go along with the Wikipedia example.

http://nbviewer.ipython.org/gist/Jessime/62c2a8a0baa2c0d061b8

The example and the code are really straightforward, so check it out!

Cheers!

Monday, August 3, 2015

Bioinformatics Software

This isn't a long post, but I read something interesting the other day and wanted to repost it. First, here's the link to the blog:

http://www.bioinformaticszen.com/

But since the post is so short, here it is in it's entirety:

"Every piece of software written for publication, which then can't be found after publication is grant funding thrown away. Every piece of software that only worked for the author during manuscript preparation is grant money thrown away. Every piece of software reinvented solely for the purpose of adding a new feature and publishing is grant money thrown away.

Grants are harder and harder to obtain, yet we fund the current attrition of moving bioinformatics software forward one reinvention at a time. Where else is it acceptable to reinvent a tool rather than try to improve upon an existing one? Count how many types of webservers are commonly used across the web, then count how many different genome assemblers have been published. If we were a company and we behaved this way, we would have gone out of business long ago. We have accepted a state whereby the short-term reward of publication trumps the longer-term goal of maintaining and improving existing open-source software. We now reap the rewards of this every time we can't find stable software to run our analyses."

I think anyone who's tried to learn bioinformatics instinctively knows this is true. Instead of whining about just how true I've found it to be in the past year, I'll offer two high-level solutions. By "high-level", I realize I ignore all of the depressing, nitty-gritty details.

1. Reproducing computational results must be encouraged and rewarded. Given the current research environment, the rate of independent software validation is unlikely to increase. We as a community must find methods to foster reproducibility; perhaps in ways as simple as offering "bounties" to teams or individuals who validate newly published software.

2. Guidelines for software publication should become as strict manuscript guidelines. Here's an outline of what's needed to publish to Science:
http://www.sciencemag.org/site/feature/contribinfo/prep/prep_init.xhtml
Where's the equivalent document for software? "Software" isn't even mentioned at all in the guidelines.

Monday, July 13, 2015

Excision Walkthrough

I take things too seriously sometimes. As far as I'm aware, I've always been like that. When I was a kid, I knew, deep down, that walkthroughs were bad. Maybe they weren't quite morally wrong, but they were certainly pushing it. You can probably ask my brother how I felt about walkthroughs. He was often on the receiving end of my scorn when he used a website to figure out a particularly difficult puzzle. "That's not the way you're SUPPOSED to do it!" Children can be so vicious. Little did I know I would eventually be publishing my own walkthrough.

Hopefully I have done my fair share of maturing. Games, I realize almost two decades later than everyone else, are intended to be fun by definition. If utilizing a guide makes the game more entertaining for you, then by all means, use it. Enjoy the little things in life the way you like them, not the way that you're "supposed" to, or the way that others tell you you should. It's better to learn these things late than never, right?

I've had a tremendous amount of fun putting this game together. That being said, I am still taking things very seriously, and have also viewed this as a learning experience. I've never tried to market anything before. To put it lightly, I failed terribly. Most of the code that I wrote for this game is rather different than the style of code I write for work. Anyone with any web development experience would have a few things to say about my email script.

"Whatever. This is my first time, and my next one will be better." Tell yourself that and go try something new. Learn something difficult.

Needless to say, the following link contains spoilers. It also contains a pretty awesome story. While I fully encourage the use of this document, only use it if you know that's what you want to do.

[This link has been temporarily removed]

Until next time,
Cheers!

Wednesday, July 8, 2015

Excision Winner Announcement

Congratulations to Ankur Sundara for being the first person to complete Excision! For his dedication and problem-solving abilities, he's won the first place price of $50! Ankur, a New Hampshire resident, is a rising senior in high school. Ankur had this to say about the game: "Excision was an awesome competition! It was a fun mix of knowledge of biology, computer security, and programming skills. My prior participation in hacking/programming competitions definitely helped me out a lot. Some of the challenges had me stumped for a while, but they all had pretty elegant solutions. I hope to see a bigger, better sequel to this competition in the future!" I want to thank everyone who participated in the game or even gave it a glance. Also, for those of you who still want to play, there's an awesome second place price: a sense of adventure and the feeling of a job well done! Until next time!

Saturday, February 21, 2015

Base Camp

Welcome to Excision.

Before we set off on our adventure, I think it's only fair to give you some provisions for your journey. They come mostly in the form of the basic philosophy of Excision. You're going to be playing the role of an Agent who runs into some unique circumstances which are going to require some equally unique talents. But I'm sure you're up for the job.

The basic objective of Excision is to complete a number of stages. There will also be an accompanying story, but I'll explain up front that the basic template for each stage is (with some variations):
1. Read the story. It provides important context for the stage, and it’s entertaining!
2. Work through the puzzle at that stage.
3. Uncover the password necessary to unlock the next stage/clue.
4. Find the location of the next stage/clue.
5. Repeat.

The game itself has a few main themes, which closely echo the themes of this blog. Computers and programming are really exciting subjects, not to mention increasingly ubiquitous in our everyday lives. Unfortunately, they can seem intimidating to learn. I'm hoping these stages can walk you through a few of the basics. Of course, since this is a bioinformatics blog, there's an emphasis on computing in biology, but the skills necessary to complete the stages are useful in any field. There is almost always more than one way to get to the next stage. Think creatively. Think logically. Talk to other people; sometimes people think about things in different ways.

Of course, what's a game without a prize? The first person to complete the final stage of the puzzle will be the winner of $50! The $50 can be sent directly to you (cash or check) or donated in your name to your favorite charity.

Additionally, here's a brief reminder that, while anyone can play Excision, to be eligible for the prize you have to like the new Facebook page for the blog: www.facebook.com/WhatIsBioinformatics

I'm trying to separate my personal Facebook account from the blog, and this is a good way of doing it. More importantly for the game, the Facebook page is the platform of the referral/reward system.
The Facebook page is how you can earn personalized hints in order to complete Excision more quickly. Games are always more entertaining when played with friends, so invite yours to like the page and help you solve the challenges. Below are the specifics of how to earn and spend hints:

• If someone mentions your name in a comment or a post on the Facebook page, you earn 1 hint.
• If you share the Facebook page on your Facebook account, you earn 2 hints.
• Special “Bonus Hints” times will give Agents unique ways to earn extra hints.
• You can spend hints by sending a private message to the Facebook page.
• In the message, describe how you earned your hint (Comment/Share/Bonus Hints).
• If you earned your hint via a comment, include the name of the person who mentioned you.
• Include whatever question you want to have answered.
• While responses may not be immediate, one of our team members will get back to you within 24 hours.
• Questions asked can be retracted, for whatever reason, at any point before a clue has been given by sending the message “Please ignore the previous question”.
• Retracted questions will not cost a hint.
• Hints can be stockpiled indefinitely and do not need to be spent any given time, even during Bonus Hints.

A copy of the official rules for the Excision contest may be found at http://tinyurl.com/oq3aw4r. Continuing to the first level from this blog post implies that you have read and accepted the rules and that you know what it takes to be eligible for the prize.

Note that this is Base Camp. A few of the stages are located here, though they're encrypted and password-protected. When you’re not sure where to go next, always try returning here.

A final tip: a good Agent, like a good scientist, always cHecks their source, not jusT the source of the inforMation he/she is currently looking at but also aLl the way back to the beginning.

Have fun and enjoy Excision!

Who was it?

What COUNTRY *space* STREET ?

What is he?

Where? xx.xxxxxx *space* xx.xxxxxx

Sunday, February 15, 2015

The Protein Folding Answer

I need to issue a disclaimer before anything else. The script below is most definitely NOT the answer to protein folding. In fact, I make no claims at all about the efficacy of this code beyond it's rudimentary ability to fold the beta-hairpin shown here. This code is just a homework assignment and learning experience for me. I believe the code is decently well commented for anyone who's working with PyRosetta themselves. MAYBE this can be used at a starting point.

I'm also not going to go through the code section by section as I did for the sequence similarity code, but here's a brief discussion of what the script is supposed to accomplish:

1. First the peptide has to be generated. It's what we are going to fold in this program.
2. Since we know we're making a beta-hairpin, we use knowledge based design to set the basic starting point angles of the backbone.
3. Then we go through a million or so cycles of what amounts to three steps:
i. Randomly shift the peptide by some amount.
ii. Calculate the energy score of the new position compared to the old one. (Since we know we're looking for hydrogen bonds, we add an extra bonus for moves that potentially form h-bonds)
iii. Decide whether to accept the new position of the peptide, or to reject it and keep the old position. This decision is based on the Metropolis Criteria.
4. As the folding simulation progresses we periodically output the energy score to see how things are going.
5. At the end of the run, we save the new peptide so we can view it later.

In a way, the steps above are deceptively simply. There are only a few processes and they're fairly straightforward. On the other hand, the theory underlying these steps are very powerful. More importantly, the amount of refinement that can be build on top of these steps is enormous.

At the beginning of the simulation, the peptide is just a straight chain:

By the end, the peptide has folded and formed hydrogen bond. The cartoon representation of the peptide shows that it does appear to be a beta-hairpin.

Below is dumped the code used to make this transformation. If you have questions about any of the specifics, I'd be happy to try to answer them. For those of you looking for the actual answer to the protein folding problem sorry for the disappointment.

Maybe next time. Cheers!!

"""Import necessary libraries and initialize rosetta"""
import random
import math
from rosetta import *
from rosetta.PyMolLink import *

init()  

"""Generate the beta-hairpin pose and score it"""
strand = "AAAAA" #alanine end residues
turn = "GG" #glycine turn residues
hairpin = strand+turn+strand #combine to make hairpin
pose = pose_from_sequence(hairpin,"fa_standard") #Create Rosetta pose
fa = get_fa_scorefxn() #Initialize a scoring function
fa(pose) #Score the hairpin
pose.dump_pdb("start1.pdb") #Generate an image of the starting structure
pose.set_psi(5,90) #Set starting angle
pose.set_phi(6,60) #Set starting angle
pose.set_psi(6,-120) #Set starting angle
pose.set_phi(7,-80) #Set starting angle
pose.set_psi(7,0) #Set starting angle
pose.set_phi(8,-120) #Set starting angle


"""Initialize variables"""
old_pose = Pose() #Initialize empty poses for later
low_pose = Pose() #Initialize empty poses for later
max_temp = 100 #Initial temperature for metropolis and annealing 
current_temp = max_temp #Current temperature will drop through cycles
max_cycles = 1000000 #How many moves you want to make 
low_energy = 10000 #Will store the value of lowest energy conformation
constraint_weight = 0.5 #Parameter for how much h-bond distances matter
ideal_dist = 0.29 #The proper distance for an h-bond between the alanines


"""Begin iterating through moves"""
for i in range(0,max_cycles): #Loop through the all the cycles


    """Update accept flag, and the old pose/energy/score"""
    old_pose.assign(pose) #Set the old_pose
    fa(old_pose) #Score the old pose
    old_energy = fa.score(old_pose) #Define variable for the energy of old pose
    old_score = old_energy #Alias old_energy, to use old_score for hbond constraining

    
    """Move given angle or angles by a chosen amount."""
    current_residue = random.randint(1,pose.total_residue()) #Choose the residue to move from a uniform distribution
    curr_phi = pose.phi(current_residue) #Set the current phi angle to be moved
    curr_psi = pose.psi(current_residue) #Set the current psi angle to be moved
    angle_shift = random.gauss(0,25*(float(i)/float(max_cycles))) #As the folding progresses, move the angle(s) by a smaller amount
    if(random.random()<(float(i)/float(max_cycles))): #As the folding progresses, make the probability of a shear move increase
        pose.set_phi(current_residue,curr_phi+angle_shift) #Set the new phi angle (shear move)
        pose.set_psi(current_residue,curr_psi-angle_shift) #Set the new psi angle (shear move)
    elif(random.random()<0 data-blogger-escaped--0.01="" data-blogger-escaped--="" data-blogger-escaped-.5="" data-blogger-escaped-0:="" data-blogger-escaped-0="" data-blogger-escaped-1000="" data-blogger-escaped-3="" data-blogger-escaped-a="" data-blogger-escaped-above="" data-blogger-escaped-accept:="" data-blogger-escaped-accept="" data-blogger-escaped-accepted="" data-blogger-escaped-accepting="" data-blogger-escaped-actions="" data-blogger-escaped-addition="" data-blogger-escaped-after="" data-blogger-escaped-alculate="" data-blogger-escaped-and="" data-blogger-escaped-angle="" data-blogger-escaped-angle_shift="" data-blogger-escaped-appropriate="" data-blogger-escaped-are="" data-blogger-escaped-as="" data-blogger-escaped-atom="" data-blogger-escaped-atoms="" data-blogger-escaped-ave="" data-blogger-escaped-bad="" data-blogger-escaped-between="" data-blogger-escaped-bonding="" data-blogger-escaped-change="" data-blogger-escaped-chosen="" data-blogger-escaped-code="" data-blogger-escaped-constraining="" data-blogger-escaped-constraint_weight="" data-blogger-escaped-core="" data-blogger-escaped-criteria="" data-blogger-escaped-cterm="pose.residue(12-2*k)" data-blogger-escaped-cterm_pos="Cterm.xyz(" data-blogger-escaped-curr_phi="" data-blogger-escaped-curr_psi="" data-blogger-escaped-current="" data-blogger-escaped-current_residue="" data-blogger-escaped-current_temp="max_temp*(math.exp(-1*max_cycles**-1))" data-blogger-escaped-delta_e="" data-blogger-escaped-dimentions="" data-blogger-escaped-dist="dist" data-blogger-escaped-distance="" data-blogger-escaped-each="" data-blogger-escaped-ecide="" data-blogger-escaped-edo="" data-blogger-escaped-efine="" data-blogger-escaped-else:="" data-blogger-escaped-end="" data-blogger-escaped-energy:="" data-blogger-escaped-energy="" data-blogger-escaped-epeat="" data-blogger-escaped-eset="" data-blogger-escaped-estore="" data-blogger-escaped-et="" data-blogger-escaped-f="" data-blogger-escaped-fa="" data-blogger-escaped-file="" data-blogger-escaped-final="" data-blogger-escaped-find="" data-blogger-escaped-first="" data-blogger-escaped-folding="" data-blogger-escaped-for="" data-blogger-escaped-hbond="" data-blogger-escaped-hbonding="" data-blogger-escaped-heck="" data-blogger-escaped-here="" data-blogger-escaped-i="" data-blogger-escaped-ideal_dist-old_dist="" data-blogger-escaped-if="" data-blogger-escaped-in="" data-blogger-escaped-inal="" data-blogger-escaped-ind="" data-blogger-escaped-is="" data-blogger-escaped-isn="" data-blogger-escaped-iterations="" data-blogger-escaped-j="" data-blogger-escaped-just="" data-blogger-escaped-k="" data-blogger-escaped-lias="" data-blogger-escaped-low_energy="" data-blogger-escaped-low_pose.assign="" data-blogger-escaped-low_pose.dump_pdb="" data-blogger-escaped-lower="" data-blogger-escaped-lowest="" data-blogger-escaped-math.fabs="" data-blogger-escaped-math.pow="" data-blogger-escaped-max_cycles="" data-blogger-escaped-metropolis="" data-blogger-escaped-move="" data-blogger-escaped-moves="" data-blogger-escaped-new="" data-blogger-escaped-new_dist="math.sqrt(dist)" data-blogger-escaped-new_energy="" data-blogger-escaped-new_score="new_score+math.fabs(ideal_dist-new_dist)*constraint_weight" data-blogger-escaped-not="" data-blogger-escaped-nterm="pose.residue(1+2*k)" data-blogger-escaped-nterm_pos="Nterm.xyz(" data-blogger-escaped-of="" data-blogger-escaped-old="" data-blogger-escaped-old_dist="math.sqrt(dist)" data-blogger-escaped-old_energy="" data-blogger-escaped-old_pose="" data-blogger-escaped-old_score="" data-blogger-escaped-one="" data-blogger-escaped-onte_carlo_2-blog2.pdb="" data-blogger-escaped-or="" data-blogger-escaped-other="" data-blogger-escaped-outcome.="" data-blogger-escaped-pair="" data-blogger-escaped-pdate="" data-blogger-escaped-pdb="" data-blogger-escaped-phi="" data-blogger-escaped-pose.="" data-blogger-escaped-pose.assign="" data-blogger-escaped-pose.set_phi="" data-blogger-escaped-pose.set_psi="" data-blogger-escaped-pose="" data-blogger-escaped-poses="" data-blogger-escaped-pre="" data-blogger-escaped-print="" data-blogger-escaped-probability="" data-blogger-escaped-process="" data-blogger-escaped-progress="" data-blogger-escaped-psi="" data-blogger-escaped-r="" data-blogger-escaped-range="" data-blogger-escaped-residue="" data-blogger-escaped-residues="" data-blogger-escaped-rint="" data-blogger-escaped-runs="" data-blogger-escaped-save="" data-blogger-escaped-score.="" data-blogger-escaped-score="" data-blogger-escaped-script="" data-blogger-escaped-second="" data-blogger-escaped-set="" data-blogger-escaped-shear="" data-blogger-escaped-so="" data-blogger-escaped-spacial="" data-blogger-escaped-statements="" data-blogger-escaped-str="" data-blogger-escaped-structure="" data-blogger-escaped-t="" data-blogger-escaped-take="" data-blogger-escaped-temperature="" data-blogger-escaped-term_pos="" data-blogger-escaped-than="" data-blogger-escaped-the="" data-blogger-escaped-there="" data-blogger-escaped-to="" data-blogger-escaped-until="" data-blogger-escaped-update="" data-blogger-escaped-use="" data-blogger-escaped-utoff="" data-blogger-escaped-value="" data-blogger-escaped-values="" data-blogger-escaped-variable="" data-blogger-escaped-very="" data-blogger-escaped-whether="" data-blogger-escaped-with="">

Friday, February 6, 2015

Encryptions

Do you know the answer? Do you know the question?

Thursday, February 5, 2015

The Protein Folding Problem

First, if you've never had the opportunity to visit xkcd, I highly suggest you do it. It's a wonderful website. My personal favorite is the What If? feature. Read one of them; you'll like it. The comic relevant to this post is:

If you're unfamiliar with the protein folding problem, let me briefly set the stage. Imagine a protein made of 100 amino acids linked together in a chain. The protein has to figure out where to put each of these amino acids relative to each other, in order to work properly. (As an aside, proteins which fail to fold properly have implications to many horrible diseases.) For simplicity, let's also image that each of the amino acids can take one of 2 positions. This implies that the protein can take 2¹⁰⁰ positions. This may or may not seem like a lot of choices to you. Try putting 2¹⁰⁰ into WolframAlpha. It works out to be 1267650600228229401496703205376 different positions, and only one of them is right.

To think about this number a different way, consider having 100 computers that will each check a trillion of these different possibilities a second. That's pretty good, right? Actually, it's not even close. It's going to take you approximately ~400,000,000 years to be able to look at all possibilities. It's absolutely impossible to fold a protein this way. It's not how nature does it, and it isn't how we attempt to do it either. Instead, the general hypothesis is that as the protein begins to fold into a favorable/proper state, it lowers it's energy, which basically crosses certain possibilities off the list without even having to check. The usual analogy is a folding funnel:

So, while all spaces on the funnel are technically possible, the protein is almost always just going to fall down the funnel. When scientist try to fold proteins, they use this funnel analogy as a method for only checking some of the possible conformations. But they check intelligently, mostly looking at the positions that have a high chance of being the natural shape of the protein.

The specifics of how to implement such a search on a computer can get as technical as you like. But a teaching tool called PyRosetta has been made that allows students like me (and you?) to get a pretty good idea of whats going on and what tools we have available to us. In my next post I'll show you a very simple script written for class that allows for the folding of a short peptide.

Until next time, cheers.

Thursday, January 15, 2015

Simple Sequence Similarity

I've mostly refrained from posting any of the code that I've been writing in the past months because the code seems scraped together in a "just get it done" method that wouldn't be particularly interesting or useful to anyone but me. The programs are simple scripts only applicable to the single task at hand. But I've finally gained enough confidence to show off a few lines that I've been using recently.

How similar is the phrase "Get it finished" to "Finished get it"? Or in protein terminology, how homologous is GETITFINISHED to FINISHEDGETIT? Many similarity calculations assume sequence linearity, and therefore would not detect much likeness between these two phrases/sequences. The script below is a very simple way of detecting similarity in these sort of "mixed up" strings.

import numpy as np
from Bio import SeqIO
from itertools import product
from scipy.stats.stats import pearsonr

def k_mers(data, outfile,k):
    def occurrences(s,k1):
        dic = {} #Initialize dictionary
        c = 0 #Initialize counter
        while c+k1 <= len(s): #Iterate through the sequence until the end
            k_mer = s[c:c+k1] #Kmer defined as a slice of sequence s and size k
            if k_mer not in dic: #If the kmer hasn't been seen before
                dic.update({k_mer:1}) #Add the kmer as a key to the dictionary. 
            elif k_mer in dic: #If the kmer has been seen before
                dic[k_mer] += 1 #Increment the kmer's value by one.
            c += 1 #Move over one letter in the sequence
        return dic #Return the full dictionary

    transcripts = SeqIO.parse(open(data, "r"),'fasta') #Open the fasta file
    c = 0 #Initialize a counter to determine size of storage array
    for i in transcripts: #For each transcript
        c+=1 #Add 1 to the counter
    transcripts = SeqIO.parse(open(data, "r"),'fasta') #Reopen the fasta file
    c_mat = np.zeros([c,4**k]) #Initialize array to store the counts   
    for i, rna in enumerate(transcripts): #For each transcript
        counts = occurrences(str(rna.seq),k) #Get the kmer counts for sequence       
        counts_ar = np.zeros([4**k]) #Initialize array to store all kmers
        for j, sub in enumerate(product("AGTC", repeat=k)): #Generate each kmer
            k_mer = "".join(sub) #Turn kmer list into a string
            if k_mer in counts: #If the kmer is in the transcript
                counts_ar[j] = counts[k_mer] #Add the proper kmer value
        c_mat[i] = counts_ar #Add new row of counts to count matrix
    np.savetxt(open(outfile, "w"),c_mat, delimiter=",") #Save the matrix
   
def k_mers_corr(c_mat, outfile):        
    c_mat = np.genfromtxt(open(c_mat,"r"), delimiter=",") #Open the count matrix
    corr_mat = np.zeros([c_mat.shape[0],c_mat.shape[0]]) #Initialize new matrix
    for i, row1 in enumerate(c_mat): #For each row in count matrix
        for j, row2 in enumerate(c_mat): #For each row in count matrix
            dist = pearsonr(row1,row2)[0] #Get the corr coeff between rows
            corr_mat[j][i] = dist #Add coeff to matrix
    np.savetxt(open(outfile, "w"),corr_mat, delimiter=",") #Save the matrix
    
p = "A:/Library2/Data/" #My directory containing all my working files
i = p+"gencode_lncRNA_overlap_m.txt" #My transcript file
o1 = p+"blogger1.csv" #Output file for the count data
o2 = p+"blogger2.csv" #Output file for the correlation data
k_mers(i,o1,4) #Call to first function
k_mers_corr(o1,o2) #Call to second function

Even with the comments, this is a lot to digest at once, so let's go through it section by section.

The first section is pretty straightforward. It simply give Python access to the necessary tools that I'm going to need later on in the script.

import numpy as np
from Bio import SeqIO
from itertools import product
from scipy.stats.stats import pearsonr

The next section defines a couple of functions. "k_mers" takes the name of a data file (currently this must be a FASTA file, but it would be easy to generalize the function to be able to read other strings as well), the name of the file to which you would like to store your data, and the size of the substring/kmer you want to analyze. It will then produce a file telling you how many times each kmer appeared in each of the the sequences you want to compare. Notice that the "occurrences" function is defined exclusively inside of the "k_mers" function. It is the function that actually does the counting. If you know much of anything about Python, you probably want to point out that Python has a built in count method. The count method Python provides, however, does not count overlapping sub-strings. For example if you used the default count method on "AGTC" with a size of 2, it would return "AG" and "TC". "occurrences", on the other hand, will return "AG", "GT", and "TC"'.

def k_mers(data, outfile,k):
    def occurrences(s,k1):
        dic = {} #Initialize dictionary
        c = 0 #Initialize counter
        while c+k1 <= len(s): #Iterate through the sequence until the end
            k_mer = s[c:c+k1] #Kmer defined as a slice of sequence s and size k
            if k_mer not in dic: #If the kmer hasn't been seen before
                dic.update({k_mer:1}) #Add the kmer as a key to the dictionary. 
            elif k_mer in dic: #If the kmer has been seen before
                dic[k_mer] += 1 #Increment the kmer's value by one.
            c += 1 #Move over one letter in the sequence
        return dic #Return the full dictionary

Next, we need to know how many transcripts we are going to analyze. So, we open the file and read through it with a counter that adds one each time a new transcript is encountered. I'm not going to get into generators here, but Python needs to reopen a file to be able to read through it again, so I have to recall the same line.

    transcripts = SeqIO.parse(open(data, "r"),'fasta') #Open the fasta file
    c = 0 #Initialize a counter to determine size of storage array
    for i in transcripts: #For each transcript
        c+=1 #Add 1 to the counter
    transcripts = SeqIO.parse(open(data, "r"),'fasta') #Reopen the fasta file

Now we can make an 2-by-2 array that's eventually going to be filled with the number of kmers that are present in each transcript and begin cycling through them. For each transcript, we call the "occurrences" function we defined above. We send "occurrences" the current transcript we are working on as well as the size of the kmers we are looking for. It will return with a dictionary of the kmers and their corresponding counts.

    c_mat = np.zeros([c,4**k]) #Initialize array to store the counts   
    for i, rna in enumerate(transcripts): #For each transcript
        counts = occurrences(str(rna.seq),k) #Get the kmer counts for sequence

The next few lines might make more sense if you've looked ahead to what sort of statistical method we are eventually going to use for comparison of the sequences. But consider the trivial case where we want to compare "AGTA" to "CAAA" using a kmer size of 1. The first transcript is going to produce "A"=2, "G"=1, and "T"=1 while the second will be "A"=4 and "C"=1. But what is needed for "an apples-to-apples" comparison is the first "A"=2, "G"=1, "T"=1, "C"=0 while the second will now be "A"=3, "G"=0, "T"=0, and "C"=1. These next lines are producing those necessary placeholders.

        for j, sub in enumerate(product("AGTC", repeat=k)): #Generate each kmer
            k_mer = "".join(sub) #Turn kmer list into a string
            if k_mer in counts: #If the kmer is in the transcript
                counts_ar[j] = counts[k_mer] #Add the proper kmer value

The last two lines of this function are pretty self explanatory. It would have been a bit easier to keep the matrix in memory and continue on to the correlation calculations, but here is a natural breaking point which allows the user to do other things with the data if they wish. "k_mers_corr" is a fairly simple function, which takes the names of the file produced in the previous function, as well as another output file to store this new data. The new data will be a symmetric matrix where each transcript is compared to every other transcript in the file, and their similarity are saved. After getting things going by bringing in the data and creating a square matrix of the proper size, we start iterating through the rows of our data.

def k_mers_corr(c_mat, outfile):        
    c_mat = np.genfromtxt(open(c_mat,"r"), delimiter=",") #Open the count matrix
    corr_mat = np.zeros([c_mat.shape[0],c_mat.shape[0]]) #Initialize new matrix
    for i, row1 in enumerate(c_mat): #For each row in count matrix

For each row we go through all of the rows (including the row we are currently on) and compare their similarity via the Pearson correlation. Once we have that number, we store it in the proper cell in the matrix we initialized earlier. Once we have filled all of the cells, we save our data.

        for j, row2 in enumerate(c_mat): #For each row in count matrix
            dist = pearsonr(row1,row2)[0] #Get the corr coeff between rows
            corr_mat[j][i] = dist #Add coeff to matrix
    np.savetxt(open(outfile, "w"),corr_mat, delimiter=",") #Save the matrix

And we're finished! The functions I've walked through here are actually stripped down versions of what I have been using recently (don't want to give away ALL my secrets!), so it would appear that I've made my first real Bioinformatics post. I hope I made it clear but if any clarifications are needed, feel free to ask a question in the comments below. Cheers!

Sunday, January 11, 2015

Syntax Highlighting

I had to follow a few different explanations on how to employ syntax highlighting on Blogger before I found a set of instructions that worked. In case anyone else is interested in posting some code onto their site, I would highly recommend checking out this blog, which has a nice visual walk through and the necessary code available. It only took me a few minutes to set up. Below are a couple of brief examples (including the one on the post) of what the highlighting looks like.

<?php
$example = range(0, 9);
foreach ($example as $value)
{
 echo $value;
}

#Here are some examples of syntax highlighting
"""This is a doc string."""
def Example(name):
    for i in range(4):
        print "Hello, %s!" %name

Example("Dan")

Code