What Is Bioinformatics?

Sphinx Documentation with Numpydoc

2016-06-06T18:25:00.000-07:00

I'm finally back from my proposal and orals! Now that I'm done with that mess, here's a fun little tutorial that I've already found helpful for myself. Hopefully others find it useful as well:

Using R in a Jupyter/Ipython Notebook with Ubuntu

2016-01-22T15:54:00.002-08:00

I may have mentioned it before, but I love the Jupyter Notebook. It's the only way I come close to keeping my code and analyses organized. I'm not a huge fan of R, mostly because I'm not great at it, but I have to use it sometimes. Today was my third attempt at integrating the R kernel into the Notebook, and it took several hours. I figured it would be worthwhile to chronicle what finally worked, in case I need it later.

Here's my setup:

Ubuntu 14.04 LTS
CPython 2.7.11

IPython 4.0.1

R version 3.2.3 (2015-12-10)

I have no idea how well these steps would work with any variation in any of these versions. One thing I do know, using R 3.2.x is critical. I was previously using R 3.0.2, which turned out to be the difficulty. 3.0.2 continues to be the default download of R on Ubuntu, so a couple of extra steps are necessary to get 3.2.3 So, if you need to upgrade R:

Open the 'Ubuntu Software Center'.
Edit -> Software Sources.
Click the 'Other Software' tab.
Click the 'Add' button.
Fill in 'deb http://cran.fhcrc.org/bin/linux/ubuntu trusty/'
Click the '+ Add Source' button.

This will allow apt-get to properly upgrade R. You can then go to your terminal:

$sudo apt-get update
$sudo apt-get upgrade
$sudo apt-get install r-base r-base-dev

At this point you should have a proper version of R. Next we can start working on kernel dependencies. Also from the terminal, run:

$sudo apt-get install libzmq3-dev

We're almost done. From inside R (using either the terminal or an IDE like RStudio), run:

install.packages(c('rzmq','repr','IRkernel','IRdisplay'),
                 repos = c('http://irkernel.github.io/', getOption('repos')),
                 type = 'source')
IRkernel::installspec()

And there you go! Things should be up and running now, and you should have the option to create an R notebook:

End of the Year Review (Part III)

2016-01-04T21:37:00.000-08:00

This post contains the last three question. The final two concern the phrase "your scientific approach", which I found a bit vague. I asked my PI for clarification, and he responded:

We all employ short-term and long-term strategies to achieve our goals. These include our working habits, the way we think about scientific questions, strategies for surveying the literature, the way we set up experiments and controls, the way we analyze data, and how we set ourselves up today for questions we want to address in the future.

All in all, I found this exercise to be quite useful, and would encourage anyone reading this to mentally answer the questions for themselves. Anyway, here are the last three questions.

What are your goals and milestones for 2016?

If 2015 was about gathering my research tools, then 2016 is about applying my new skills towards my project. That isn’t to say that I think I can quit getting better at programming. I hope that next year, my code has improved as much as it did in the last 12 months. All of the shortcomings I previously listed, for example, are on my TODO list for 2016.

This coming semester is going to be full. I want to learn a lot from my final class, Object Oriented Data Analysis, which should teach statistical techniques for high-dimensional data. Similarly, I’m hoping that the three classes I’m TAing will provide useful review opportunities and reinforce what I learned last year. Finally, I plan on not simply passing my oral preliminary exam, but using the preparation time to read and retain as much background literature as possible. I hope that I will stay focused enough to gain a full sense of everything that has been published in relation to my project. Hopefully, this clearer view of the field will then allow me to even more precisely define what research questions I am addressing.

By beginning of summer, I plan on having a logical list of analyses to code and run. My goal for summer, then, is to minimize distractions and plow through the code. Again, I already have a well-defined project and a pretty good idea what the analyzes will be, but orals should set me up to be exceptionally productive (in terms of research results) during the summer months. By the end of the summer, I would like to have enough high-quality data to publish in a medium impact journal.

This coming Fall, I should have no obligations in terms of classes, TAing, prelims, ect. I would be disappointed if I didn’t have a paper submitted by the end of the year. Projecting this far into the future is a bit difficult, but I think Fall should be broken into two section. Assuming that summer goes as I have outlined above, it seems appropriate to take some time to tie up loose ends and follow any especially interesting leads. In other words, I would like to take the first half of Fall to see if we have any data or insights from the summer that launch us into the high-impact range of journals. The second half of Fall should be spent writing; I don’t think it’s far-fetched to plan to have a full-fledged paper out the door before the Holidays.

What aspects of your scientific approach will you look to maintain in 2016?

Overall, I’m quite satisfied with my current scientific approach; I’ve gotten in a pretty good groove. Here are a few points I hope I keep up next year:

I’m optimistic and enthusiastic about my project. Too many graduate students don’t seem to enjoy their work for one reason or another, which hurts productivity.
I have a schedule that works for me. I may come in a bit late, but I stay fairly focused throughout the day and evening, and usually manage to pump out a few more hours of work from at night before bed.
As I outlined in my ‘met goals’ section, I’m doing well at developing my analytical toolbox.
I’m generating logical and worthwhile research questions at a decent pace. A lot of them don’t pan out in the long run, of course, but I want to keep coming up with new questions.
While I wish that I would simply make less mistakes, I want to continue to be conscientious enough about my analyses that I catch my errors sooner rather than later. Just because I generate a set of numbers doesn’t mean I quit thinking about the code or the problem.

On a final note, I also want to make sure I stay generally healthy. Nothing makes it harder to do good science than being sick.

What aspects of your scientific approach can you improve in 2016?

I’ve always considered my propensity towards carving out quiet time one of my greatest intellectual strengths. Finding some time each day to reflect or just let your mind wander over the terrain of a problem is a critical component of creative problem solving. Creativity, in turn, is often necessary to overcome a difficult problem or see what everyone else is missing. Unfortunately, over the past few years, I have been finding it more and more difficult to make time for silence.

There’s a popular internet-term, ‘shower thoughts’, which describes the phenomenon I just mentioned. What I find depressing is the probable etymology of the term. Showers are literally the single time in any given day that the average person spends without mental stimulation. Therefore, it is the only time when people come up with little treasures like ‘The object of golf is to play the least amount of golf.’ Everyone recognizes how busy modern society is, but we don’t fully appreciate how damaging constant visual/auditory input is to our deeper mental capacities.

So, this year, I want to make more of an effort for making the time to just sit and think about my project. This means I’m not listening to music, checking notifications on my phone, or talking to people. It also means I’m not coding, reading new literature, or watching a Python tutorial. I’m only thinking about the project. I honestly think this is the single biggest improvement I, or virtually anyone else, could make in my/their scientific approach for 2016.

We all employ short-term and long-term strategies to achieve our goals. These include our working habits, the way we think about scientific questions, strategies for surveying the literature, the way we set up experiments and controls, the way we analyze data, and how we set ourselves up today for questions we want to address in the future.

Below are a few smaller improvements I want to make:

I want to spend a bit more time each week reading literature, particularly in small doses.
Sometimes, in the interest of time, I skip on computational controls because I assume I know what’s going on. It’s worth coding in more controls.
In the interest of health, I should spend more time exercising.

Overall, I want to be more deliberate and efficient with my time.

End of Year Review (Part II)

2016-01-02T10:18:00.000-08:00

It's rarely fun to admit your failures, but here's a list of things I didn't mange to pull off in 2015.

What are the goals and milestones you missed in 2015?

It’s not like I immediately understood everything I tried to learn in the last year. There are a number of tools that I still don’t feel comfortable with:

There are several functionalities that I’ve played with in Python’s core language that would likely be useful, but I haven’t had the time to properly understand them, or how to efficiently implement them in my code.

Generators can save on memory. Things like lists of 10mers could be generators.
Decorators would be useful for cleanly modifying functions when I want to have several options for how to run some piece of core logic.
Inheritance is going to become more important as I continue to switch over to a more OOP approach.

Matplotlib has some serious disadvantages for plotting large datasets. I’ve made several attempts at investigating other plotting libraries, especially bokeh, but nothing has clicked with me. Part of the problem is that some of these libraries are still under very active development, so things are changing quickly. Another issue is that I only devote a small amount of time to trying to plot something (like an animated plot), and if it doesn’t work out quickly, I give up because I feel like I’m falling behind on analyses.
I was really hoping my Machine Learning course would give me the skills I needed to implement a neural network in Theano. Given the past few months, my guess is that I still have a serious time investment barrier before I would be able to properly incorporate a deep learning model into my research.
I picked up the basics of git during my third rotation. I use it to store backup copies of my code, but I haven’t used it enough to employ it properly. I realize that if I used git enough though (creating branches, etc.) it could potentially save me from some goofy disaster while editing code or messing around with files.
Similarly, I got fairly comfortable with the Killdevil cluster during my third rotation as well. Because of the difficulties of visualization of the Notebook on Killdevil, I often avoid using Killdevil. This means I haven’t kept up my command lines skills as much as I should have. I was originally hoping to be regularly using tools like ‘grep’, ‘awk’ and bash scripting by now, but I can avoid a lot of it with the Ubuntu GUI. Now, every time I do need the cluster for something, it slows me down considerably.
I’m in the same boat with regular expressions. Every once in awhile I find a problem suitable for regex, google until I have the answer, and move on. I still don’t have enough real understanding of the tool to have the solutions I find stick around in my memory, so I don’t really learn.
Statistics remains another weak area for me. The stats class I took at the beginning of grad school was a good introduction, but I need to find a way to stay fresh and continue to grow in this area. I still have a lot of trouble deciding when it’s appropriate to use what statistical test, for example.

There’s a theme here. Every couple months, I run into a problem that can be patched and/or worked around, or actually fixed with one of these tools. I decided I’m going to try out one of these tools, and spend a couple hours reading and trying some stuff out. I learn a little bit, but not enough to fix whatever problem I’m having. I become frustrated because I wanted to have the problem fixed by the end of the day and I know I can apply my work around. So, I just give up, apply my monkey patch, and move on. By the time I run into a similar problem a couple months later, the little bit I learned during those few researching and practicing hours has been completely forgotten and the cycle starts over. This is an unfortunate waste of time, especially since all of the tools listed are extremely valuable resources. I’ve been aware of this issue for a while now, and was hoping that I would have some method for avoiding it by now.

If anyone has any suggestions on how to avoid and overcome the phenomenon I've described, I'd love to hear them!

End of Year Review (Part I)

2015-12-30T19:05:00.001-08:00

My PI has requested that we answer a few questions as a reflection of the past year and as a mental preparation for the coming year. I figured it would be worthwhile to record at least a few of my answers here. This seems like a useful exercise and my answers are decently in depth. So the post isn't overwhelming, I'm going to dedicate a different post for each question.

What are the goals and milestones you met in 2015?

I have a vivid memory from Christmas break last year of asking a friend for advice on how to tackle a dataset that was larger than I knew how to handle. I had been doing scientific programming for about a month, and was worried that I was going to have to pick up C instead of continuing with Python. My code was slow and it seemed like weird things were happening with my memory usage. Not only did I have no clue what I was doing, but I also didn’t have any idea how I was supposed to go about learning what to do.

Part of the reason I remember the conversation is because the dataset I was struggling with was the same one that I use almost everyday now. Today, thinking back to my coding style a year ago makes me want to shake my head for a variety of reasons. I took the time to find some code from last year; it was as bad as I remembered. Here’s a small list of problems (in no particular order) with the code:

Most of the scripts have virtually no comments or documentation.
Everything is hard-coded. Occasionally, there are chunks of data stored in the script. There are random strings referring to data files throughout the code. There’s no way I would reliably have found and changed all those strings if I wanted to use different data.
There’s only minimal use of libraries like numpy and scipy; I didn’t feel very comfortable with them yet.
Several scripts have no functions whatsoever, just 100 lines of straight code. All of these scripts were obviously capable of doing exactly one thing.
Other scripts have one main function that take a couple basic parameters (like the names of files). Unfortunately, I didn’t yet know how to incorporate command-line options into a python script. At the bottom of the script there’s a long list of strings corresponding to the files on which I had run the code. As I performed different runs of the script, I would comment out previous strings…

These coding problems had very real effects on my ability to do computational research. It was constantly difficult for me to remember which script I needed to bring up to run a particular analysis. Once I found the code, I often forgot to switch out one or more of the strings in my code, which meant it ran incorrectly and I had to rerun the code. And at the end of the week, I often didn’t remember how I managed to produce the data I had stored.

My primary goal for last year was to become comfortable with the Python scientific ecosystem. I’ve devoted a lot of time and effort to finding and exploring new computational tools in order to do so. While I’ve only explored a tiny fraction of all the available resources, I believe I achieved my goal insofar as I can now reliably reproduce my research and I know where to look when I need answers to a new question. Below is a list of tools that have become indispensable, and how they have allowed me to reach specific computational milestones:

Surprisingly, https://www.reddit.com/r/Python has been a great resource. The community is active and new packages, updates, and other resources are constantly posted and discussed. I now have a way to stay up to date with virtually everything Python has to offer as a data analysis tool.
Numpy/Scipy/Pandas/Matplotlib/sklearn are now part of my everyday tools. I understand their APIs, documentation, and am generally familiar with their capabilities and limitations. This means that I can much more quickly import, manipulate, analyze, plot, and save large datasets than I could last year. A great example is using Numpy and linear algebra for computing Pearson’s correlation ~500 times faster than I was previously.
Discovering the Ipython/Jupyter Notebook has been the biggest game changer for me. Using the notebook allows me to essentially record everything I do in a month. A lot of code that I write has a one-and-done functionality. There’s no reason for me to store it as its own script. Now I don’t have to. I can make notes to myself in the same location as my code, save graphs directly below the code that made them, and section off my projects in a way that makes sense. Basically, I now have a method for reproducibility that has stopped my code directory structure from becoming an incomprehensible disaster. Organization is not my natural forte, so having the capability to use the Notebooks feels like a major accomplishment.
Another large benefit of the Notebook is that the .py files I do write are readable and reusable. When I write some code I know I’m going to need again, I take a bit of time and rewrite it in a .py file as a class. I properly break the code up into modular functions and remember to write a docstring for each function. Inevitably, when I go to rerun my code, I’m going to want to try something slightly different than I did the first time. Now, all of my core logic is abstracted away in a .py file. All I have to do is instantiate the class I want inside of the Notebook and choose the functions I want to use this time. All of my slight variations from run to run are immediately saved in the notebook.

While I’m still a long way away from a data analysis expert, I can honestly say that I no longer feel on the brink of disaster like I did this time last year. Improvements can be made in each of the four points listed, but I’m happy to say that I believe I’m on the right track.

Teach Your Fourth Grader Python (Make a Game)

2015-12-28T12:27:00.003-08:00

Welcome to the second post on introducing your child to Python! You should have Anaconda (or another version of Python) installed on your computer before continuing. As I mentioned before, I wrote this code for and with a class of 4th graders. Did all of them understand all of the concepts? No, of course not; that wasn't really the point. All of the students thought that the game was fun. All of the students gained a little bit of insight into the technology they use everyday. That's the point.

Similarly, I'm not going to attempt to fully describe everything that's going on in the code here either. The hope is that I can given enough of a description that you'll be able to either guess or google how to change a line if you wanted to do something similar. What if you decided you wanted to do division problems instead? What if you wanted to add a third player? Exercises like these are fun ways to familiarize yourself with what's going on in a script.

Remember, this is for Python 3 and won't work with Python 2. Leave a comment if you want the 2.7 version for any reason. Here is the code in full:

import time
import random

#Introduction
print("Hello, Players!")
print("Welcome to Multiplication Battles!")
print("")

#Input information
player1 = input("What is your name, Player 1? ")
player1_points = 0
player2 = input("What is your name, Player2? ")
player2_points = 0 
length_per_round = int(input("How many seconds do you want each round to be? "))
highest_num = int(input("What's the largest number you want to multiply? "))

#Launch Player1's game
print("")
input("{}! Hit Enter to start your time.".format(player1))
current_time = time.time()
end_time = current_time + length_per_round
while current_time <= end_time:
    
    #Do math
    a = random.randint(0, highest_num)
    b = random.randint(0, highest_num)
    c = a*b
    answer = int(input("What is {}x{}? ".format(a,b)))
  
    #Check if right or wrong
    if answer == c:
        print("You're Correct!")
        player1_points += 1
    else:
        print("Sorry. {}x{}={}.".format(a,b,c))
  
    #Must update the time at the end of each loop.
    current_time = time.time()

print("You scored {} points!".format(player1_points))

#Get ready for Player2
print("")
input("{}! Hit Enter to start your time.".format(player2))
current_time = time.time()
end_time = current_time + length_per_round
while current_time <= end_time:
  
    #Do math
    a = random.randint(0, highest_num)
    b = random.randint(0, highest_num)
    c = a*b
  
    #Check if right or wrong
    answer = int(input("What is {}x{}? ".format(a,b)))
    if answer == c:
        print("You're Correct!")
        player2_points += 1
    else:
        print("Sorry. {}x{}={}.".format(a,b,c))
    
    #Must update the time at the end of each loop.
    current_time = time.time()
print("You scored {} points!".format(player2_points))

#Decide a winner
print("")
if player1_points > player2_points:
  print("{} wins! Good job!".format(player1))
elif player2_points > player1_points:
  print("{} wins! Good job!".format(player2))
else:
  print("It was a tie! You two should play again!")

Now we'll go through a section by section breakdown. Any line that begins with a # is a comment, and doesn't effect the way the code runs. Think of them as notes to yourself that the computer ignores.

import time
import random

These two lines give us access to other pieces of code we'll use later in the script. This page lists all of the 'libraries' that come standard with Python.

#Introduction
print("Hello, Players!")
print("Welcome to Multiplication Battles!")
print("")print ""

This is the beginning of the game. The keyword 'print' is how text shows up in the console when running python. Notice that everything being printed in inside of quotation marks. This just indicates that we are printing some normal text, technically called a 'string'. Other things, like integers, for example, can also be printed.

#Input information
player1 = input("What is your name, Player 1? ")
player1_points = 0
player2 = input("What is your name, Player 2? ")
player2_points = 0 
length_per_round = int(input("How many seconds do you want each round to be? "))
highest_num = int(input("What's the largest number you want to multiply? "))

Before launching into questions, the game collects a little bit of information about who's playing. The first thing we do is create 'player1' by asking the game players to input the name of the first player. Whatever the player types in will be stored inside of the variable 'player1'. We need to keep track of how many points the first person has, so we'll create another variable called player1_points. Since the game hasn't started, the first player currently has 0 points. The reason that the words on the left hand side of the equals sign are called variables is because their values can change over the course of the program. When the first player correctly answers a question, they'll gain a point and we'll overwrite the score stored in 'player1_points'. We'll see how this works later. Variables are a pretty complicated concept for 4th graders. The idea was one of the biggest difficult while teaching this course. If anyone has difficulty understanding what's going on here, I highly recommend checking out 'How to Automate the Boring Stuff'.

We then repeat the process for the second player, and collect their information. Lastly, we need to know how long we want each round to last and what the largest number is that we feel comfortable multiplying. Maybe we want to go easy and play a game for 30 seconds only multiplying up to 10. Or we could go for something more challenging and give ourselves a minute to multiply numbers up to 20. Whatever you want to do. Notice the 'int' surrounding the 'input' on the last two lines. 'int' stands for 'integer', and we use this because we want Python to treat our entries in these lines like numbers instead of 'strings'.

#Launch Player1's game
print("")
input("{}! Hit Enter to start your time.".format(player1))
current_time = time.time()
end_time = current_time + length_per_round

Line number 3 is a nice way to pause the game until the first player is ready. But what's going on with the '{}'? It's just a placeholder in the text. When it is printed to the console, the first player's name will be displayed in the sentence. So, if I were playing, for example, the prompt would read, 'Jessime! Hit Enter to start your time.' As soon as I hit enter, the next line of code would execute.
The next line is where one of our imports come in handy. We're just going to access the computer's clock to store the current time. We need to do this because we want the player to answer questions for 30 seconds, or however long was chosen. So we need to know what time the player started. We then calculate when the player's time is up (by adding up our current time and how long we want to play), and store it in the 'end_time' variable.

while current_time <= end_time:
    
    #Do math
    a = random.randint(0, highest_num)
    b = random.randint(0, highest_num)
    c = a*b
    answer = int(input("What is {}x{}? ".format(a,b)))

This is where the first player begins getting questions. The first line is what's known as a while loop. Loops are another tough subject; use the link and other resources on Google to get more information if necessary. What's happening here is quite straightforward though. We're telling Python 'Hey, execute this next bit of code over and over until the first player runs out of time'. The 'next bit of code' means all of the lines which start 4 spaces over. The code that we're going to execute over and over is:

Generate a question
Ask the player the question
Give the player a point if they answer correctly
Tell the player the correct answer if they miss the question
Update the time and see if the 30 seconds is up or not

These five steps will be repeated. Lines 4-7 in the code above knock out the first two steps. We create 'a' and 'b' by using the 'random' library to generate a random integer in the range of 0 to our highest number (12, for example). The correct solution to the question will be 'a' times 'b', which we store as 'c'. Then we ask the question to the player and store their guess as 'answer'.

        #Check if right or wrong
        if answer == c:
            print("You're Correct!")
            player1_points += 1
        else:
            print("Sorry. {}x{}={}.".format(a,b,c))

Note: This should be indented 4 spaces, but the syntax highlighter won't allow is. Look at the whole code or the player2 example for proper indentation.
At this point, one of two things can happen. Either the player is right, or they're wrong. Line two says, if the answer the player gave is equivalent to the correct answer, execute the next bit of code. Again, the 'next bit of code' means all of the lines that are indented 4 spaces over. So, if the player get's the question correct, we're going to do two things. We print to the console letting the player know they answered correctly. We then increment 'player1_points' by one. That is, if the question is the first correctly answered, 'player1_points' goes from 0 to 1. If the first player has 4 points already, then 'player1_points' will now be equal to 5. If the player did not correctly answer the question, we print the correct equation.

    #Must update the time at the end of each loop.
    current_time = time.time()

This line is absolutely key to the while loop. It's the last line of the while loop, which means, since we're in a loop, we're about to jump back up to line 22 (of the whole code). Once we are back at 22, we're going to reevaluate if we have any time left. The only way that the 'current_time' variable will be accurate is if we update it by 'current_time = time.time()'. If we don't have this line, the while condition will always be true, you'll enter an infinite loop, and you'll have to shutdown python manually. Make sure to have this line.

Once the current time is greater than the end time, we'll exit out of the while loop, and the first player's turn will be over.

print("You scored {} points!".format(player1_points))

We can let the first player know how many points they scored. It may not look like it from the line count, but at this point, we're pretty much finished. A majority of the rest of the code is a duplicate of what we just did, but for the second player.

#Get ready for Player2
print("")
input("{}! Hit Enter to start your time.".format(player2))
current_time = time.time()
end_time = current_time + length_per_round
while current_time <= end_time:
  
    #Do math
    a = random.randint(0, highest_num)
    b = random.randint(0, highest_num)
    c = a*b
  
    #Check if right or wrong
    answer = int(input("What is {}x{}? ".format(a,b)))
    if answer == c:
        print("You're Correct!")
        player2_points += 1
    else:
        print("Sorry. {}x{}={}.".format(a,b,c))
    
    #Must update the time at the end of each loop.
    current_time = time.time()
print("You scored {} points!".format(player2_points))

Literally the only change here is that I've replaced 'player1' with 'player2'. Now that each player has had their turn, the only thing left to do is figure out who the winner is.

#Decide a winner
print("")
if player1_points > player2_points:
  print("{} wins! Good job!".format(player1))
elif player2_points > player1_points:
  print("{} wins! Good job!".format(player2))
else:
  print("It was a tie! You two should play again!")

This block of code should look very similar to the part where we decided if the players correctly answered the question or not. The only difference is that there are now three things that can happen:

Player 1 can win.
Player 2 can win.
The players can tie.

Only one of these three statements will print, and we make the decision by comparing 'player1_points' to 'player2_points' and responding appropriately.

And that's it! Once the scripts executes this last line, it will automatically end and the game is over.

Teach Your Fourth Grader Python (Installing Anaconda)

2015-12-28T12:22:00.000-08:00

It's really never too early to start teaching your child to code. You may think, "What's the point of teaching my kid to code so long before they have any idea what they want to do with their life?" The topic is worth a post of its own, but here are a couple of articles by Beth Werrell and Dan Crow that address how useful "computational thinking" is in a world where everything relies on software. Another possible issue that might come to mind is, "How do I teach my child to code when I don't even know how?" Thankfully, there are already some really great resources out there to help you learn and teach at the same time. One of my favorite books is Invent Your Own Computer Games with Python which is free online. Another resource, which I haven't used at all but generally gets good reviews is Teach Your Kids to Code: A Parent-Friendly Guide to Python Programming.

I had the opportunity earlier this semester to team up with a friend of mine to each a few hours worth of Python to her 4th grade class. Let me point out that this isn't nearly enough time to learn a programming language. It's enough time to see what coding is and complete a small project to get the students interested in coding. In 4th grade kids are still learning/perfecting their multiplication tables, so we decided to make a multiplication 2-player game.

We'll go over two things. First, I'll show you how to easily install Python (it can be a little bit of a challenge on your own). Then, in the next post, I'll present the code that I wrote for class, and describe the basics of what it does. Again, this isn't going to provide enough information for you to "learn how to code". It's a fun first project to show you why learning to code is worthwhile. I would suggest using one of the resources I linked above to figure out all of the details about how the code works.

Installing Anaconda

Anaconda is a version of Python nicely bundled into a convenient package. Here are a list of steps to get up and running

1. Go to https://www.continuum.io/downloads to download the program.

2. You'll see options for Windows, OSX, and Linux. Choose the one that works for you.

3. You'll also have the option between Python 2.7 and 3.5. You should choose 3.5 unless you have a very specific reason for choosing 2.7.

4. Begin the download by clicking the 3.5 Graphical Installer I've highlighted in the red rectangle above.

5. Once the download is complete, open the installer.

6. Just follow the defaults and walk through the installer.

7. Agree to the Terms of Service.

8. Install for Just me.

9. Accept the default 'Destination Folder'

10. Click the 'Install' button.

11. Once the install as been completely finished, search for 'Spyder' in your computer's search. Look for an icon similar to the one below:

12. Note: Ignore that my icon has Python 2.7, it's what I use because of legacy issues.

13. Select the icon to launch Spyder. It will take a minute to launch.

14. Congratulations! You're ready to start coding.

In case you're wondering Spyder is what's known as an IDE. At its simplest, it is just a place to write and run your code. If yours doesn't look exactly the same as mine, don't worry. It's going to look different depending on whether you're using Windows, OSX or Linux.

Calculating a Pearson Correlation Matrix

2015-10-05T19:54:00.000-07:00

One of the great things about graduate school is you get to constantly realize how ignorant you are about how things work. If the domain name weren't already taken, I would think about moving my stuff over to seriously.dontusethiscode.com. It can get a bit depressing thinking about all the stuff there is to know. But if you think about it a bit more, it's really just awesome how exciting our world is. I'm glad I've got this opportunity to learn and explore.

Anyway, I found a question on reddit that reminded me of a post I did a while ago. One of the key points of 'Simple Sequence Similarity' was calculating a Pearson correlation matrix. While I realize that it isn't exactly the same as the problem on the reddit link, the nested iteration made me realize that I should probably show some improved code that I currently use to calculate the correlation matrix. By "improved", I mean ~500x faster.

Here's a link to the notebook.

As a fun bonus (which I won't dive into right now) there are at least four different examples of Ipython magic throughout the notebook. As always, I welcome any comments or suggestions!

Cheers!

Bot Battle

2015-08-28T21:27:00.000-07:00

For those of you who don't program, it may be hard to understand just how slow a process it usually is. Movies and other media don't help this misunderstanding. As Mark Zuckerberg put it, "If they were really making a movie, it would have been of me, sitting at a computer coding for two hours straight". Similarly, here's a visual representation of the difference.

Every once is a while though, it's actually useful to be able to throw together a quick script, or to generate a plot while you're having a discussion with your colleague. It's not something I'm particularly skilled at since I don't do it much. So is there some fun way to practice coding quickly?

Let's make a bot battle competition!

I wanted this to be a pretty straightforward weekend project, so I decided to stick with a simple game that I love a lot. Reversi, also known as Othello is a great game that even little kids can easily understand. The game itself hasn't been solved though, so it's far from trivial. It's the perfect game for this sort of project.

I'm not going to run through all of the code or anything, but the repo is available if you want to try it out yourself. A more detailed explanation of how to run the code is there, as well as the general rules of the game. If you don't feel like playing yourself, here's a quick clip of what a Series might look like:

If you have any suggestions or need help setting up the game, just let me know!

Cheers!

The Perceptron

2015-08-07T20:58:00.000-07:00

Today in lab meeting, during our bi-weekly "Idea Potluck", I dropped the ball while explaining how perceptrons work. We were running short on time, so I rushed and straight-up got flustered. It was unfortunate, since perceptrons are really cool, and Wikipedia has a great example (here).

To over compensate, I've written a tutorial to go along with the Wikipedia example.

http://nbviewer.ipython.org/gist/Jessime/62c2a8a0baa2c0d061b8

The example and the code are really straightforward, so check it out!

Cheers!

Bioinformatics Software

2015-08-03T20:45:00.000-07:00

This isn't a long post, but I read something interesting the other day and wanted to repost it. First, here's the link to the blog:

http://www.bioinformaticszen.com/

But since the post is so short, here it is in it's entirety:

"Every piece of software written for publication, which then can't be found after publication is grant funding thrown away. Every piece of software that only worked for the author during manuscript preparation is grant money thrown away. Every piece of software reinvented solely for the purpose of adding a new feature and publishing is grant money thrown away.

Grants are harder and harder to obtain, yet we fund the current attrition of moving bioinformatics software forward one reinvention at a time. Where else is it acceptable to reinvent a tool rather than try to improve upon an existing one? Count how many types of webservers are commonly used across the web, then count how many different genome assemblers have been published. If we were a company and we behaved this way, we would have gone out of business long ago. We have accepted a state whereby the short-term reward of publication trumps the longer-term goal of maintaining and improving existing open-source software. We now reap the rewards of this every time we can't find stable software to run our analyses."

I think anyone who's tried to learn bioinformatics instinctively knows this is true. Instead of whining about just how true I've found it to be in the past year, I'll offer two high-level solutions. By "high-level", I realize I ignore all of the depressing, nitty-gritty details.

1. Reproducing computational results must be encouraged and rewarded. Given the current research environment, the rate of independent software validation is unlikely to increase. We as a community must find methods to foster reproducibility; perhaps in ways as simple as offering "bounties" to teams or individuals who validate newly published software.

2. Guidelines for software publication should become as strict manuscript guidelines. Here's an outline of what's needed to publish to Science:
http://www.sciencemag.org/site/feature/contribinfo/prep/prep_init.xhtml
Where's the equivalent document for software? "Software" isn't even mentioned at all in the guidelines.

Excision Walkthrough

2015-07-13T19:06:00.002-07:00

I take things too seriously sometimes. As far as I'm aware, I've always been like that. When I was a kid, I knew, deep down, that walkthroughs were bad. Maybe they weren't quite morally wrong, but they were certainly pushing it. You can probably ask my brother how I felt about walkthroughs. He was often on the receiving end of my scorn when he used a website to figure out a particularly difficult puzzle. "That's not the way you're SUPPOSED to do it!" Children can be so vicious. Little did I know I would eventually be publishing my own walkthrough.

Hopefully I have done my fair share of maturing. Games, I realize almost two decades later than everyone else, are intended to be fun by definition. If utilizing a guide makes the game more entertaining for you, then by all means, use it. Enjoy the little things in life the way you like them, not the way that you're "supposed" to, or the way that others tell you you should. It's better to learn these things late than never, right?

I've had a tremendous amount of fun putting this game together. That being said, I am still taking things very seriously, and have also viewed this as a learning experience. I've never tried to market anything before. To put it lightly, I failed terribly. Most of the code that I wrote for this game is rather different than the style of code I write for work. Anyone with any web development experience would have a few things to say about my email script.

"Whatever. This is my first time, and my next one will be better." Tell yourself that and go try something new. Learn something difficult.

Needless to say, the following link contains spoilers. It also contains a pretty awesome story. While I fully encourage the use of this document, only use it if you know that's what you want to do.

[This link has been temporarily removed]

Until next time,
Cheers!

Excision Winner Announcement

2015-07-08T07:53:00.002-07:00

Congratulations to Ankur Sundara for being the first person to complete Excision! For his dedication and problem-solving abilities, he's won the first place price of $50! Ankur, a New Hampshire resident, is a rising senior in high school. Ankur had this to say about the game: "Excision was an awesome competition! It was a fun mix of knowledge of biology, computer security, and programming skills. My prior participation in hacking/programming competitions definitely helped me out a lot. Some of the challenges had me stumped for a while, but they all had pretty elegant solutions. I hope to see a bigger, better sequel to this competition in the future!" I want to thank everyone who participated in the game or even gave it a glance. Also, for those of you who still want to play, there's an awesome second place price: a sense of adventure and the feeling of a job well done! Until next time!

Base Camp

2015-02-21T17:45:00.001-08:00

Welcome to Excision.

Before we set off on our adventure, I think it's only fair to give you some provisions for your journey. They come mostly in the form of the basic philosophy of Excision. You're going to be playing the role of an Agent who runs into some unique circumstances which are going to require some equally unique talents. But I'm sure you're up for the job.

The basic objective of Excision is to complete a number of stages. There will also be an accompanying story, but I'll explain up front that the basic template for each stage is (with some variations):
1. Read the story. It provides important context for the stage, and it’s entertaining!
2. Work through the puzzle at that stage.
3. Uncover the password necessary to unlock the next stage/clue.
4. Find the location of the next stage/clue.
5. Repeat.

The game itself has a few main themes, which closely echo the themes of this blog. Computers and programming are really exciting subjects, not to mention increasingly ubiquitous in our everyday lives. Unfortunately, they can seem intimidating to learn. I'm hoping these stages can walk you through a few of the basics. Of course, since this is a bioinformatics blog, there's an emphasis on computing in biology, but the skills necessary to complete the stages are useful in any field. There is almost always more than one way to get to the next stage. Think creatively. Think logically. Talk to other people; sometimes people think about things in different ways.

Of course, what's a game without a prize? The first person to complete the final stage of the puzzle will be the winner of $50! The $50 can be sent directly to you (cash or check) or donated in your name to your favorite charity.

Additionally, here's a brief reminder that, while anyone can play Excision, to be eligible for the prize you have to like the new Facebook page for the blog: www.facebook.com/WhatIsBioinformatics

I'm trying to separate my personal Facebook account from the blog, and this is a good way of doing it. More importantly for the game, the Facebook page is the platform of the referral/reward system.
The Facebook page is how you can earn personalized hints in order to complete Excision more quickly. Games are always more entertaining when played with friends, so invite yours to like the page and help you solve the challenges. Below are the specifics of how to earn and spend hints:

• If someone mentions your name in a comment or a post on the Facebook page, you earn 1 hint.
• If you share the Facebook page on your Facebook account, you earn 2 hints.
• Special “Bonus Hints” times will give Agents unique ways to earn extra hints.
• You can spend hints by sending a private message to the Facebook page.
• In the message, describe how you earned your hint (Comment/Share/Bonus Hints).
• If you earned your hint via a comment, include the name of the person who mentioned you.
• Include whatever question you want to have answered.
• While responses may not be immediate, one of our team members will get back to you within 24 hours.
• Questions asked can be retracted, for whatever reason, at any point before a clue has been given by sending the message “Please ignore the previous question”.
• Retracted questions will not cost a hint.
• Hints can be stockpiled indefinitely and do not need to be spent any given time, even during Bonus Hints.

A copy of the official rules for the Excision contest may be found at http://tinyurl.com/oq3aw4r. Continuing to the first level from this blog post implies that you have read and accepted the rules and that you know what it takes to be eligible for the prize.

Note that this is Base Camp. A few of the stages are located here, though they're encrypted and password-protected. When you’re not sure where to go next, always try returning here.

A final tip: a good Agent, like a good scientist, always cHecks their source, not jusT the source of the inforMation he/she is currently looking at but also aLl the way back to the beginning.

Have fun and enjoy Excision!

Who was it?

What COUNTRY *space* STREET ?

What is he?

Where? xx.xxxxxx *space* xx.xxxxxx

The Protein Folding Answer

2015-02-15T21:19:00.000-08:00

I need to issue a disclaimer before anything else. The script below is most definitely NOT the answer to protein folding. In fact, I make no claims at all about the efficacy of this code beyond it's rudimentary ability to fold the beta-hairpin shown here. This code is just a homework assignment and learning experience for me. I believe the code is decently well commented for anyone who's working with PyRosetta themselves. MAYBE this can be used at a starting point.

I'm also not going to go through the code section by section as I did for the sequence similarity code, but here's a brief discussion of what the script is supposed to accomplish:

1. First the peptide has to be generated. It's what we are going to fold in this program.
2. Since we know we're making a beta-hairpin, we use knowledge based design to set the basic starting point angles of the backbone.
3. Then we go through a million or so cycles of what amounts to three steps:
i. Randomly shift the peptide by some amount.
ii. Calculate the energy score of the new position compared to the old one. (Since we know we're looking for hydrogen bonds, we add an extra bonus for moves that potentially form h-bonds)
iii. Decide whether to accept the new position of the peptide, or to reject it and keep the old position. This decision is based on the Metropolis Criteria.
4. As the folding simulation progresses we periodically output the energy score to see how things are going.
5. At the end of the run, we save the new peptide so we can view it later.

In a way, the steps above are deceptively simply. There are only a few processes and they're fairly straightforward. On the other hand, the theory underlying these steps are very powerful. More importantly, the amount of refinement that can be build on top of these steps is enormous.

At the beginning of the simulation, the peptide is just a straight chain:

By the end, the peptide has folded and formed hydrogen bond. The cartoon representation of the peptide shows that it does appear to be a beta-hairpin.

Below is dumped the code used to make this transformation. If you have questions about any of the specifics, I'd be happy to try to answer them. For those of you looking for the actual answer to the protein folding problem sorry for the disappointment.

Maybe next time. Cheers!!

"""Import necessary libraries and initialize rosetta"""
import random
import math
from rosetta import *
from rosetta.PyMolLink import *

init()  

"""Generate the beta-hairpin pose and score it"""
strand = "AAAAA" #alanine end residues
turn = "GG" #glycine turn residues
hairpin = strand+turn+strand #combine to make hairpin
pose = pose_from_sequence(hairpin,"fa_standard") #Create Rosetta pose
fa = get_fa_scorefxn() #Initialize a scoring function
fa(pose) #Score the hairpin
pose.dump_pdb("start1.pdb") #Generate an image of the starting structure
pose.set_psi(5,90) #Set starting angle
pose.set_phi(6,60) #Set starting angle
pose.set_psi(6,-120) #Set starting angle
pose.set_phi(7,-80) #Set starting angle
pose.set_psi(7,0) #Set starting angle
pose.set_phi(8,-120) #Set starting angle


"""Initialize variables"""
old_pose = Pose() #Initialize empty poses for later
low_pose = Pose() #Initialize empty poses for later
max_temp = 100 #Initial temperature for metropolis and annealing 
current_temp = max_temp #Current temperature will drop through cycles
max_cycles = 1000000 #How many moves you want to make 
low_energy = 10000 #Will store the value of lowest energy conformation
constraint_weight = 0.5 #Parameter for how much h-bond distances matter
ideal_dist = 0.29 #The proper distance for an h-bond between the alanines


"""Begin iterating through moves"""
for i in range(0,max_cycles): #Loop through the all the cycles


    """Update accept flag, and the old pose/energy/score"""
    old_pose.assign(pose) #Set the old_pose
    fa(old_pose) #Score the old pose
    old_energy = fa.score(old_pose) #Define variable for the energy of old pose
    old_score = old_energy #Alias old_energy, to use old_score for hbond constraining

    
    """Move given angle or angles by a chosen amount."""
    current_residue = random.randint(1,pose.total_residue()) #Choose the residue to move from a uniform distribution
    curr_phi = pose.phi(current_residue) #Set the current phi angle to be moved
    curr_psi = pose.psi(current_residue) #Set the current psi angle to be moved
    angle_shift = random.gauss(0,25*(float(i)/float(max_cycles))) #As the folding progresses, move the angle(s) by a smaller amount
    if(random.random()<(float(i)/float(max_cycles))): #As the folding progresses, make the probability of a shear move increase
        pose.set_phi(current_residue,curr_phi+angle_shift) #Set the new phi angle (shear move)
        pose.set_psi(current_residue,curr_psi-angle_shift) #Set the new psi angle (shear move)
    elif(random.random()<0 data-blogger-escaped--0.01="" data-blogger-escaped--="" data-blogger-escaped-.5="" data-blogger-escaped-0:="" data-blogger-escaped-0="" data-blogger-escaped-1000="" data-blogger-escaped-3="" data-blogger-escaped-a="" data-blogger-escaped-above="" data-blogger-escaped-accept:="" data-blogger-escaped-accept="" data-blogger-escaped-accepted="" data-blogger-escaped-accepting="" data-blogger-escaped-actions="" data-blogger-escaped-addition="" data-blogger-escaped-after="" data-blogger-escaped-alculate="" data-blogger-escaped-and="" data-blogger-escaped-angle="" data-blogger-escaped-angle_shift="" data-blogger-escaped-appropriate="" data-blogger-escaped-are="" data-blogger-escaped-as="" data-blogger-escaped-atom="" data-blogger-escaped-atoms="" data-blogger-escaped-ave="" data-blogger-escaped-bad="" data-blogger-escaped-between="" data-blogger-escaped-bonding="" data-blogger-escaped-change="" data-blogger-escaped-chosen="" data-blogger-escaped-code="" data-blogger-escaped-constraining="" data-blogger-escaped-constraint_weight="" data-blogger-escaped-core="" data-blogger-escaped-criteria="" data-blogger-escaped-cterm="pose.residue(12-2*k)" data-blogger-escaped-cterm_pos="Cterm.xyz(" data-blogger-escaped-curr_phi="" data-blogger-escaped-curr_psi="" data-blogger-escaped-current="" data-blogger-escaped-current_residue="" data-blogger-escaped-current_temp="max_temp*(math.exp(-1*max_cycles**-1))" data-blogger-escaped-delta_e="" data-blogger-escaped-dimentions="" data-blogger-escaped-dist="dist" data-blogger-escaped-distance="" data-blogger-escaped-each="" data-blogger-escaped-ecide="" data-blogger-escaped-edo="" data-blogger-escaped-efine="" data-blogger-escaped-else:="" data-blogger-escaped-end="" data-blogger-escaped-energy:="" data-blogger-escaped-energy="" data-blogger-escaped-epeat="" data-blogger-escaped-eset="" data-blogger-escaped-estore="" data-blogger-escaped-et="" data-blogger-escaped-f="" data-blogger-escaped-fa="" data-blogger-escaped-file="" data-blogger-escaped-final="" data-blogger-escaped-find="" data-blogger-escaped-first="" data-blogger-escaped-folding="" data-blogger-escaped-for="" data-blogger-escaped-hbond="" data-blogger-escaped-hbonding="" data-blogger-escaped-heck="" data-blogger-escaped-here="" data-blogger-escaped-i="" data-blogger-escaped-ideal_dist-old_dist="" data-blogger-escaped-if="" data-blogger-escaped-in="" data-blogger-escaped-inal="" data-blogger-escaped-ind="" data-blogger-escaped-is="" data-blogger-escaped-isn="" data-blogger-escaped-iterations="" data-blogger-escaped-j="" data-blogger-escaped-just="" data-blogger-escaped-k="" data-blogger-escaped-lias="" data-blogger-escaped-low_energy="" data-blogger-escaped-low_pose.assign="" data-blogger-escaped-low_pose.dump_pdb="" data-blogger-escaped-lower="" data-blogger-escaped-lowest="" data-blogger-escaped-math.fabs="" data-blogger-escaped-math.pow="" data-blogger-escaped-max_cycles="" data-blogger-escaped-metropolis="" data-blogger-escaped-move="" data-blogger-escaped-moves="" data-blogger-escaped-new="" data-blogger-escaped-new_dist="math.sqrt(dist)" data-blogger-escaped-new_energy="" data-blogger-escaped-new_score="new_score+math.fabs(ideal_dist-new_dist)*constraint_weight" data-blogger-escaped-not="" data-blogger-escaped-nterm="pose.residue(1+2*k)" data-blogger-escaped-nterm_pos="Nterm.xyz(" data-blogger-escaped-of="" data-blogger-escaped-old="" data-blogger-escaped-old_dist="math.sqrt(dist)" data-blogger-escaped-old_energy="" data-blogger-escaped-old_pose="" data-blogger-escaped-old_score="" data-blogger-escaped-one="" data-blogger-escaped-onte_carlo_2-blog2.pdb="" data-blogger-escaped-or="" data-blogger-escaped-other="" data-blogger-escaped-outcome.="" data-blogger-escaped-pair="" data-blogger-escaped-pdate="" data-blogger-escaped-pdb="" data-blogger-escaped-phi="" data-blogger-escaped-pose.="" data-blogger-escaped-pose.assign="" data-blogger-escaped-pose.set_phi="" data-blogger-escaped-pose.set_psi="" data-blogger-escaped-pose="" data-blogger-escaped-poses="" data-blogger-escaped-pre="" data-blogger-escaped-print="" data-blogger-escaped-probability="" data-blogger-escaped-process="" data-blogger-escaped-progress="" data-blogger-escaped-psi="" data-blogger-escaped-r="" data-blogger-escaped-range="" data-blogger-escaped-residue="" data-blogger-escaped-residues="" data-blogger-escaped-rint="" data-blogger-escaped-runs="" data-blogger-escaped-save="" data-blogger-escaped-score.="" data-blogger-escaped-score="" data-blogger-escaped-script="" data-blogger-escaped-second="" data-blogger-escaped-set="" data-blogger-escaped-shear="" data-blogger-escaped-so="" data-blogger-escaped-spacial="" data-blogger-escaped-statements="" data-blogger-escaped-str="" data-blogger-escaped-structure="" data-blogger-escaped-t="" data-blogger-escaped-take="" data-blogger-escaped-temperature="" data-blogger-escaped-term_pos="" data-blogger-escaped-than="" data-blogger-escaped-the="" data-blogger-escaped-there="" data-blogger-escaped-to="" data-blogger-escaped-until="" data-blogger-escaped-update="" data-blogger-escaped-use="" data-blogger-escaped-utoff="" data-blogger-escaped-value="" data-blogger-escaped-values="" data-blogger-escaped-variable="" data-blogger-escaped-very="" data-blogger-escaped-whether="" data-blogger-escaped-with="">

Encryptions

2015-02-06T18:07:00.001-08:00

Do you know the answer? Do you know the question?

The Protein Folding Problem

2015-02-05T20:37:00.001-08:00

First, if you've never had the opportunity to visit xkcd, I highly suggest you do it. It's a wonderful website. My personal favorite is the What If? feature. Read one of them; you'll like it. The comic relevant to this post is:

If you're unfamiliar with the protein folding problem, let me briefly set the stage. Imagine a protein made of 100 amino acids linked together in a chain. The protein has to figure out where to put each of these amino acids relative to each other, in order to work properly. (As an aside, proteins which fail to fold properly have implications to many horrible diseases.) For simplicity, let's also image that each of the amino acids can take one of 2 positions. This implies that the protein can take 2¹⁰⁰ positions. This may or may not seem like a lot of choices to you. Try putting 2¹⁰⁰ into WolframAlpha. It works out to be 1267650600228229401496703205376 different positions, and only one of them is right.

To think about this number a different way, consider having 100 computers that will each check a trillion of these different possibilities a second. That's pretty good, right? Actually, it's not even close. It's going to take you approximately ~400,000,000 years to be able to look at all possibilities. It's absolutely impossible to fold a protein this way. It's not how nature does it, and it isn't how we attempt to do it either. Instead, the general hypothesis is that as the protein begins to fold into a favorable/proper state, it lowers it's energy, which basically crosses certain possibilities off the list without even having to check. The usual analogy is a folding funnel:

So, while all spaces on the funnel are technically possible, the protein is almost always just going to fall down the funnel. When scientist try to fold proteins, they use this funnel analogy as a method for only checking some of the possible conformations. But they check intelligently, mostly looking at the positions that have a high chance of being the natural shape of the protein.

The specifics of how to implement such a search on a computer can get as technical as you like. But a teaching tool called PyRosetta has been made that allows students like me (and you?) to get a pretty good idea of whats going on and what tools we have available to us. In my next post I'll show you a very simple script written for class that allows for the folding of a short peptide.

Until next time, cheers.

Simple Sequence Similarity

2015-01-15T20:34:00.002-08:00

I've mostly refrained from posting any of the code that I've been writing in the past months because the code seems scraped together in a "just get it done" method that wouldn't be particularly interesting or useful to anyone but me. The programs are simple scripts only applicable to the single task at hand. But I've finally gained enough confidence to show off a few lines that I've been using recently.

How similar is the phrase "Get it finished" to "Finished get it"? Or in protein terminology, how homologous is GETITFINISHED to FINISHEDGETIT? Many similarity calculations assume sequence linearity, and therefore would not detect much likeness between these two phrases/sequences. The script below is a very simple way of detecting similarity in these sort of "mixed up" strings.

import numpy as np
from Bio import SeqIO
from itertools import product
from scipy.stats.stats import pearsonr

def k_mers(data, outfile,k):
    def occurrences(s,k1):
        dic = {} #Initialize dictionary
        c = 0 #Initialize counter
        while c+k1 <= len(s): #Iterate through the sequence until the end
            k_mer = s[c:c+k1] #Kmer defined as a slice of sequence s and size k
            if k_mer not in dic: #If the kmer hasn't been seen before
                dic.update({k_mer:1}) #Add the kmer as a key to the dictionary. 
            elif k_mer in dic: #If the kmer has been seen before
                dic[k_mer] += 1 #Increment the kmer's value by one.
            c += 1 #Move over one letter in the sequence
        return dic #Return the full dictionary

    transcripts = SeqIO.parse(open(data, "r"),'fasta') #Open the fasta file
    c = 0 #Initialize a counter to determine size of storage array
    for i in transcripts: #For each transcript
        c+=1 #Add 1 to the counter
    transcripts = SeqIO.parse(open(data, "r"),'fasta') #Reopen the fasta file
    c_mat = np.zeros([c,4**k]) #Initialize array to store the counts   
    for i, rna in enumerate(transcripts): #For each transcript
        counts = occurrences(str(rna.seq),k) #Get the kmer counts for sequence       
        counts_ar = np.zeros([4**k]) #Initialize array to store all kmers
        for j, sub in enumerate(product("AGTC", repeat=k)): #Generate each kmer
            k_mer = "".join(sub) #Turn kmer list into a string
            if k_mer in counts: #If the kmer is in the transcript
                counts_ar[j] = counts[k_mer] #Add the proper kmer value
        c_mat[i] = counts_ar #Add new row of counts to count matrix
    np.savetxt(open(outfile, "w"),c_mat, delimiter=",") #Save the matrix
   
def k_mers_corr(c_mat, outfile):        
    c_mat = np.genfromtxt(open(c_mat,"r"), delimiter=",") #Open the count matrix
    corr_mat = np.zeros([c_mat.shape[0],c_mat.shape[0]]) #Initialize new matrix
    for i, row1 in enumerate(c_mat): #For each row in count matrix
        for j, row2 in enumerate(c_mat): #For each row in count matrix
            dist = pearsonr(row1,row2)[0] #Get the corr coeff between rows
            corr_mat[j][i] = dist #Add coeff to matrix
    np.savetxt(open(outfile, "w"),corr_mat, delimiter=",") #Save the matrix
    
p = "A:/Library2/Data/" #My directory containing all my working files
i = p+"gencode_lncRNA_overlap_m.txt" #My transcript file
o1 = p+"blogger1.csv" #Output file for the count data
o2 = p+"blogger2.csv" #Output file for the correlation data
k_mers(i,o1,4) #Call to first function
k_mers_corr(o1,o2) #Call to second function

Even with the comments, this is a lot to digest at once, so let's go through it section by section.

The first section is pretty straightforward. It simply give Python access to the necessary tools that I'm going to need later on in the script.

import numpy as np
from Bio import SeqIO
from itertools import product
from scipy.stats.stats import pearsonr

The next section defines a couple of functions. "k_mers" takes the name of a data file (currently this must be a FASTA file, but it would be easy to generalize the function to be able to read other strings as well), the name of the file to which you would like to store your data, and the size of the substring/kmer you want to analyze. It will then produce a file telling you how many times each kmer appeared in each of the the sequences you want to compare. Notice that the "occurrences" function is defined exclusively inside of the "k_mers" function. It is the function that actually does the counting. If you know much of anything about Python, you probably want to point out that Python has a built in count method. The count method Python provides, however, does not count overlapping sub-strings. For example if you used the default count method on "AGTC" with a size of 2, it would return "AG" and "TC". "occurrences", on the other hand, will return "AG", "GT", and "TC"'.

def k_mers(data, outfile,k):
    def occurrences(s,k1):
        dic = {} #Initialize dictionary
        c = 0 #Initialize counter
        while c+k1 <= len(s): #Iterate through the sequence until the end
            k_mer = s[c:c+k1] #Kmer defined as a slice of sequence s and size k
            if k_mer not in dic: #If the kmer hasn't been seen before
                dic.update({k_mer:1}) #Add the kmer as a key to the dictionary. 
            elif k_mer in dic: #If the kmer has been seen before
                dic[k_mer] += 1 #Increment the kmer's value by one.
            c += 1 #Move over one letter in the sequence
        return dic #Return the full dictionary

Next, we need to know how many transcripts we are going to analyze. So, we open the file and read through it with a counter that adds one each time a new transcript is encountered. I'm not going to get into generators here, but Python needs to reopen a file to be able to read through it again, so I have to recall the same line.

    transcripts = SeqIO.parse(open(data, "r"),'fasta') #Open the fasta file
    c = 0 #Initialize a counter to determine size of storage array
    for i in transcripts: #For each transcript
        c+=1 #Add 1 to the counter
    transcripts = SeqIO.parse(open(data, "r"),'fasta') #Reopen the fasta file

Now we can make an 2-by-2 array that's eventually going to be filled with the number of kmers that are present in each transcript and begin cycling through them. For each transcript, we call the "occurrences" function we defined above. We send "occurrences" the current transcript we are working on as well as the size of the kmers we are looking for. It will return with a dictionary of the kmers and their corresponding counts.

    c_mat = np.zeros([c,4**k]) #Initialize array to store the counts   
    for i, rna in enumerate(transcripts): #For each transcript
        counts = occurrences(str(rna.seq),k) #Get the kmer counts for sequence

The next few lines might make more sense if you've looked ahead to what sort of statistical method we are eventually going to use for comparison of the sequences. But consider the trivial case where we want to compare "AGTA" to "CAAA" using a kmer size of 1. The first transcript is going to produce "A"=2, "G"=1, and "T"=1 while the second will be "A"=4 and "C"=1. But what is needed for "an apples-to-apples" comparison is the first "A"=2, "G"=1, "T"=1, "C"=0 while the second will now be "A"=3, "G"=0, "T"=0, and "C"=1. These next lines are producing those necessary placeholders.

        for j, sub in enumerate(product("AGTC", repeat=k)): #Generate each kmer
            k_mer = "".join(sub) #Turn kmer list into a string
            if k_mer in counts: #If the kmer is in the transcript
                counts_ar[j] = counts[k_mer] #Add the proper kmer value

The last two lines of this function are pretty self explanatory. It would have been a bit easier to keep the matrix in memory and continue on to the correlation calculations, but here is a natural breaking point which allows the user to do other things with the data if they wish. "k_mers_corr" is a fairly simple function, which takes the names of the file produced in the previous function, as well as another output file to store this new data. The new data will be a symmetric matrix where each transcript is compared to every other transcript in the file, and their similarity are saved. After getting things going by bringing in the data and creating a square matrix of the proper size, we start iterating through the rows of our data.

def k_mers_corr(c_mat, outfile):        
    c_mat = np.genfromtxt(open(c_mat,"r"), delimiter=",") #Open the count matrix
    corr_mat = np.zeros([c_mat.shape[0],c_mat.shape[0]]) #Initialize new matrix
    for i, row1 in enumerate(c_mat): #For each row in count matrix

For each row we go through all of the rows (including the row we are currently on) and compare their similarity via the Pearson correlation. Once we have that number, we store it in the proper cell in the matrix we initialized earlier. Once we have filled all of the cells, we save our data.

        for j, row2 in enumerate(c_mat): #For each row in count matrix
            dist = pearsonr(row1,row2)[0] #Get the corr coeff between rows
            corr_mat[j][i] = dist #Add coeff to matrix
    np.savetxt(open(outfile, "w"),corr_mat, delimiter=",") #Save the matrix

And we're finished! The functions I've walked through here are actually stripped down versions of what I have been using recently (don't want to give away ALL my secrets!), so it would appear that I've made my first real Bioinformatics post. I hope I made it clear but if any clarifications are needed, feel free to ask a question in the comments below. Cheers!

Syntax Highlighting

2015-01-11T13:26:00.002-08:00

I had to follow a few different explanations on how to employ syntax highlighting on Blogger before I found a set of instructions that worked. In case anyone else is interested in posting some code onto their site, I would highly recommend checking out this blog, which has a nice visual walk through and the necessary code available. It only took me a few minutes to set up. Below are a couple of brief examples (including the one on the post) of what the highlighting looks like.

<?php
$example = range(0, 9);
foreach ($example as $value)
{
 echo $value;
}

#Here are some examples of syntax highlighting
"""This is a doc string."""
def Example(name):
    for i in range(4):
        print "Hello, %s!" %name

Example("Dan")

Be Right Back

2014-12-15T21:30:00.000-08:00

I'm usually not a huge fan of television, but recently it has been convenient to watch an episode of something or other with my wife after dinner. I reason that it gives me a bit of time to digest and unwind before I start in on homework or whatever project I'm working on. Usually Laura chooses what we watch, since it doesn't matter to me. But after reading this review of Black Mirror, I suggested we watch it.

Black Mirror is a British series that actually began airing in 2011. But as uncultured as I am, I hadn't heard about it until it began streaming on Netflix. Each episode is completely independent of the others, though they're very similar in theme and tone. Each episode covers some field of technology (ranging from currently available to fairly futuristic) crashing with the realistic intricacies of humanity. Most episodes focus on computer-centric scenarios, but Be Right Back covers a rather troublesome possibility of computational biology. I'm using the term broadly in this context, but it's fitting.

I don't want to give away any details, so I'll end by echoing The Verge article: these shows are creepy, and definitely intended for a mature audiance. But they're well worth the watch and deserve the award they won for Best TV Mini-Series.

Some Ethical Issues of Genomic Data

2014-12-04T22:23:00.001-08:00

Yesterday I spent several hours attending an ethics class as well as a ethics/policy seminary. Both were great, and while I haven't had a large amount of time to process the information myself, I thought it might be useful to reflect on it here and share what I learned. The sessions were on some of the issues and implications of high throughput sequencing and genomic data. While the lecture was interesting, it's the class discussion I've been thinking about most .

In class we talked about the trade off between privacy and information when dealing with personal genomic data. A few different subjects were covered. The difference in opt-in verses opt-out consent forms and what it means to given "informed" consent was brought up. People in the class thought that it's difficult for a layman to give truly informed consent, simply because the subject matter and background material are too deep to allow an understanding of the implications of genetic data in a few paragraphs. It's analogous to the Terms and Conditions we all blindly accept for using our favorite software and websites. People were also attempting to find the optimum phrasing to ensure that maximum utility of the genomic information could be realized without the phrasing of the consent form becoming vague and open enough that it worried potential subjects. Several people were concerned about data later being abused by researchers as it was in Arizona.

Also discussed was de-identifying verses anonymizing data. De-identified data has had the identifications are removed and cryptographically secured, but can be later re-identified by the holder of the encryption keys if necessary. Anonymous data, theoretically, has had the connection between the genome and the person has been completely removed. Most of the time was spent answering questions like, "Is it ethical to re-identify data without explicitly informing the subjects?" or "Who should be given the encryption keys?"

Overall, I felt like the discussion contained a diversity of opinions and was rather thought provoking. There was one topic, however, that seemed overlooked to me...

Personally, I'll admit that the issue of privacy has never been a huge concern of mine, but as we have seen recently, ANY digital data is vulnerable to intrusion. It doesn't matter if it's credit card, personal photo, classified, or genomic data. If it's out there, there is the possibility of it being leaked and/or abused. It doesn't even have to be illegal. What are the laws surrounding what the government is able to do with genomic data, and to what extent do organizations like the NSA feel they have to obey those laws? Is the government allowed to re-identify genomic data as evidence against a potential criminal? Say that DNA was found at a crime scene, and it doesn't match to any public database. The police have a suspect in mind, but not enough evidence for an arrest (after the arrest, they can require DNA swabs). Can they use re-identified data as cause for arrest? Other potential scenarios in which this question matters are distinctly dystopian. And while they may seem sci-fi, programs like PRISM make these topics worth discussing.

At the end of class my professor asked us if we would/will have our genome sequenced. We unanimously said yes. Because the benefits outweigh the risks, right? But then again, the poll was obviously both biased and homogeneous.

Would you read your blueprints, even if it meant that others potentially could as well?

My First Rotation

2014-11-24T16:22:00.000-08:00

I'm giving my first poster talk of my graduate school career tomorrow, along with approximately half of my cohort (the other half had their posters today). We are supposed to present what we accomplished over the past 10 weeks, during our first rotation. Though I understand that the real purpose of this poster session is to practice our communication skills and have the to opportunity to see what our peers have been doing, it's still mildly frustrating to feel like you don't have enough material to work with. Let's be honest, it's basically impossible to do enough novel research, in a field you're new to, in 10 weeks, to adequately fill an entire poster.

I wanted to upload the entire poster here, but there are probably some issues with that. On going research and all... But, I've gotten a couple of compliments on my poster design, so I'm thinking that I may use it as a template in the future. I've removed most of the text and details from the poster, so you can give me feedback on the visual aspects. A couple of things to take into consideration: there's a difference between an empty and filled poster, and some of the specifics of the current layout are a bit "bloated" to make up for the fact that there wasn't a ton of content (a bit more could be squeezed in, I'm sure).

Also, check out that QR code. It can take attendees straight to some of my microscopy videos, to help them visualize what I'm talking about. It wasn't my idea, but it's still pretty exciting. Here's the link it takes you to :

https://www.youtube.com/user/jessime100

You can watch one of the videos here too, if you prefer.

But back to the actual poster, I'd gladly take any comments or suggestion about layout, style, design, etc.

What is Genomics?

2014-11-12T15:43:00.003-08:00

I just began my second of three rotations this week, so it's been a busy transition time. But I thought I would share this short genomics video. It's really basic, which may or may not be appealing to you.

This video came from the "Education" tab from the website of my new lab. Feel free to check it out (or the rest of the site, for that matter) to get a better idea of what I'm working on at the moment.

http://www.med.unc.edu/pharm/calabreselab/education

The Desktop

2014-11-06T20:57:00.002-08:00

With my Birthday, Black Friday, and Christmas all right around the corner, I haven't been able to keep myself from thinking about the possibility of purchasing my first desktop upgrade. It's not that my computer needs work per se, I just built it in July, but I did stay under budget with the idea of upgrading in mind.

The relevancy of my desktop is complicated by two matters.

Generally, in the computational biology world, desktops can be a bit of an awkward middleman. For most everyday computing (reading papers, browsing the web, note taking, PowerPoint creation, light scripting, etc.), I am going to be on my laptop. I have to be in lab and class, so portability is necessary. On the other hand, if any real analysis is going to be needed, the lab should have workstations built for their pipeline or access to a cluster. I suppose that between those two things, it's easy to think that a home desktop isn't worth it.

Specifically, because I'm still rotating, I don't know what my long term computing needs are going to be. The bottlenecks are obviously going to be dependent on the nature of the analysis. Or the data may absolutely require that it be run on a cluster, so as long as I have a functioning terminal, everything is fine.

In this past rotation, however, having a desktop has been fantastic. I've been generating about 6 GB of image data per experiment-- and already filled 142 GB of my storage drive. Analyzing videos in ImageJ on a laptop with 4 GB of RAM isn't pleasant, and the lab didn't have an open workstation for me at the moment. One particular analysis quickly maxes out the 16 GB I currently have on my desktop. Homework also gets finished a lot faster with the extra screen space. I like the ability to have RStudio, Chrome, and the problem set all open side by side.

I don't have a particular point I'm trying to make here, I'm just considering the pros and cons of sinking money into what I primarily use as a work machine. Of course, I have made a bit of a hobby out of computer building and component researching, so there's a large amount of enjoyment simply from the process. And I would be lying to myself if I didn't admit a bit of pride in having a (fairly) fancy desktop I built sitting in the living room.

Anyway, now that the background and situation have been established, I'll leave brief updates as I continue to add and refresh components in the coming months and years. Having a complete history might be quite interesting by graduation if Moore's law keeps up. Here's a link to my current build:

http://pcpartpicker.com/user/jessime/saved/kg9nTW

I'd love to see what everyone else is working with (or dreaming about), so share your build below. Or, if you have suggestions for my first upgrade, post that.

What Did You Accomplish This Week?

2014-10-28T19:19:00.000-07:00

A few weeks ago I was jokingly asked by a friend (who has never set foot in a lab) if I had made any great discoveries that week. I had just finished confirming a genetic transformation, so I laughed and replied, "Well, I did make a new type of yeast." You may or may not consider that a bit of a stretch, but I thought it was a clever answer to an unexpected question. Apparently my exaggeration was enough to impress my friend, since I found out later that he went on to tell others about my "accomplishment." This brief interaction made me think about a couple things...

My first thoughts were about how annoying difficult it can seem to have enough things go right in a given week to really feel accomplished. You know those terrible Fridays when you look back on your week and think, "Where did it go, and why haven't I done anything?" Sure you do, we've all had them. But the question I was asking myself at that moment was "Is it unreasonable to consistently expect significant progress on a thesis project each week"? The answer obviously depends on the connotation of 'significant'. You probably won't add a line to your CV every week, and it's easy to default to a, "Well, science is hard and things go wrong," sort of attitude. The best answer I've found is a middle-of-the-road viewpoint is appropriate; I think it would be beneficial to note at least one achievement at the end of the week. This deed isn't to brag about, but to be self motivational.

I realized later that my friend was actually correct to be impressed with my statement. Not because I had done anything noteworthy, per se, but because the technique and the science behind it IS impressive. After a certain amount of time in the lab it becomes easy to lose sight of how awesome what we get to do is. You get into all of the little gritty details of thinking about the struggle of perfecting your sterile technique and ensuring that your epitubes are all properly labeled in your neatest tiny handwriting. You don't stop to consider that what you're doing. Taking traits from one organism and bestowing them on another could only be called magic a few generations ago. And there are people alive today who were born before this "mundane" task was ever done:

http://www.ncbi.nlm.nih.gov/pubmed/8150273.

For a slightly more humorous perspective, watch this Jimmy Kimmel clip:
https://www.youtube.com/watch?v=EzEr23XJwFY.

The point is, even the little things scientists do on a daily basis are awesome. My advice for this week, to myself as much as anyone, is to appreciate the opportunity you have to work in the field you do, and be sure to make the most out it.

Leave a comment below showcasing the most impressive thing you have accomplished in a week. Exaggerate at your own discretion.

Cheers