What Is Bioinformatics?: January 2016

Friday, January 22, 2016

Using R in a Jupyter/Ipython Notebook with Ubuntu

I may have mentioned it before, but I love the Jupyter Notebook. It's the only way I come close to keeping my code and analyses organized. I'm not a huge fan of R, mostly because I'm not great at it, but I have to use it sometimes. Today was my third attempt at integrating the R kernel into the Notebook, and it took several hours. I figured it would be worthwhile to chronicle what finally worked, in case I need it later.

Here's my setup:

Ubuntu 14.04 LTS
CPython 2.7.11

IPython 4.0.1

R version 3.2.3 (2015-12-10)

I have no idea how well these steps would work with any variation in any of these versions. One thing I do know, using R 3.2.x is critical. I was previously using R 3.0.2, which turned out to be the difficulty. 3.0.2 continues to be the default download of R on Ubuntu, so a couple of extra steps are necessary to get 3.2.3 So, if you need to upgrade R:

Open the 'Ubuntu Software Center'.
Edit -> Software Sources.
Click the 'Other Software' tab.
Click the 'Add' button.
Fill in 'deb http://cran.fhcrc.org/bin/linux/ubuntu trusty/'
Click the '+ Add Source' button.

This will allow apt-get to properly upgrade R. You can then go to your terminal:

$sudo apt-get update
$sudo apt-get upgrade
$sudo apt-get install r-base r-base-dev

At this point you should have a proper version of R. Next we can start working on kernel dependencies. Also from the terminal, run:

$sudo apt-get install libzmq3-dev

We're almost done. From inside R (using either the terminal or an IDE like RStudio), run:

install.packages(c('rzmq','repr','IRkernel','IRdisplay'),
                 repos = c('http://irkernel.github.io/', getOption('repos')),
                 type = 'source')
IRkernel::installspec()

And there you go! Things should be up and running now, and you should have the option to create an R notebook:

Monday, January 4, 2016

End of the Year Review (Part III)

This post contains the last three question. The final two concern the phrase "your scientific approach", which I found a bit vague. I asked my PI for clarification, and he responded:

We all employ short-term and long-term strategies to achieve our goals. These include our working habits, the way we think about scientific questions, strategies for surveying the literature, the way we set up experiments and controls, the way we analyze data, and how we set ourselves up today for questions we want to address in the future.

All in all, I found this exercise to be quite useful, and would encourage anyone reading this to mentally answer the questions for themselves. Anyway, here are the last three questions.

What are your goals and milestones for 2016?

If 2015 was about gathering my research tools, then 2016 is about applying my new skills towards my project. That isn’t to say that I think I can quit getting better at programming. I hope that next year, my code has improved as much as it did in the last 12 months. All of the shortcomings I previously listed, for example, are on my TODO list for 2016.

This coming semester is going to be full. I want to learn a lot from my final class, Object Oriented Data Analysis, which should teach statistical techniques for high-dimensional data. Similarly, I’m hoping that the three classes I’m TAing will provide useful review opportunities and reinforce what I learned last year. Finally, I plan on not simply passing my oral preliminary exam, but using the preparation time to read and retain as much background literature as possible. I hope that I will stay focused enough to gain a full sense of everything that has been published in relation to my project. Hopefully, this clearer view of the field will then allow me to even more precisely define what research questions I am addressing.

By beginning of summer, I plan on having a logical list of analyses to code and run. My goal for summer, then, is to minimize distractions and plow through the code. Again, I already have a well-defined project and a pretty good idea what the analyzes will be, but orals should set me up to be exceptionally productive (in terms of research results) during the summer months. By the end of the summer, I would like to have enough high-quality data to publish in a medium impact journal.

This coming Fall, I should have no obligations in terms of classes, TAing, prelims, ect. I would be disappointed if I didn’t have a paper submitted by the end of the year. Projecting this far into the future is a bit difficult, but I think Fall should be broken into two section. Assuming that summer goes as I have outlined above, it seems appropriate to take some time to tie up loose ends and follow any especially interesting leads. In other words, I would like to take the first half of Fall to see if we have any data or insights from the summer that launch us into the high-impact range of journals. The second half of Fall should be spent writing; I don’t think it’s far-fetched to plan to have a full-fledged paper out the door before the Holidays.

What aspects of your scientific approach will you look to maintain in 2016?

Overall, I’m quite satisfied with my current scientific approach; I’ve gotten in a pretty good groove. Here are a few points I hope I keep up next year:

I’m optimistic and enthusiastic about my project. Too many graduate students don’t seem to enjoy their work for one reason or another, which hurts productivity.
I have a schedule that works for me. I may come in a bit late, but I stay fairly focused throughout the day and evening, and usually manage to pump out a few more hours of work from at night before bed.
As I outlined in my ‘met goals’ section, I’m doing well at developing my analytical toolbox.
I’m generating logical and worthwhile research questions at a decent pace. A lot of them don’t pan out in the long run, of course, but I want to keep coming up with new questions.
While I wish that I would simply make less mistakes, I want to continue to be conscientious enough about my analyses that I catch my errors sooner rather than later. Just because I generate a set of numbers doesn’t mean I quit thinking about the code or the problem.

On a final note, I also want to make sure I stay generally healthy. Nothing makes it harder to do good science than being sick.

What aspects of your scientific approach can you improve in 2016?

I’ve always considered my propensity towards carving out quiet time one of my greatest intellectual strengths. Finding some time each day to reflect or just let your mind wander over the terrain of a problem is a critical component of creative problem solving. Creativity, in turn, is often necessary to overcome a difficult problem or see what everyone else is missing. Unfortunately, over the past few years, I have been finding it more and more difficult to make time for silence.

There’s a popular internet-term, ‘shower thoughts’, which describes the phenomenon I just mentioned. What I find depressing is the probable etymology of the term. Showers are literally the single time in any given day that the average person spends without mental stimulation. Therefore, it is the only time when people come up with little treasures like ‘The object of golf is to play the least amount of golf.’ Everyone recognizes how busy modern society is, but we don’t fully appreciate how damaging constant visual/auditory input is to our deeper mental capacities.

So, this year, I want to make more of an effort for making the time to just sit and think about my project. This means I’m not listening to music, checking notifications on my phone, or talking to people. It also means I’m not coding, reading new literature, or watching a Python tutorial. I’m only thinking about the project. I honestly think this is the single biggest improvement I, or virtually anyone else, could make in my/their scientific approach for 2016.

We all employ short-term and long-term strategies to achieve our goals. These include our working habits, the way we think about scientific questions, strategies for surveying the literature, the way we set up experiments and controls, the way we analyze data, and how we set ourselves up today for questions we want to address in the future.

Below are a few smaller improvements I want to make:

I want to spend a bit more time each week reading literature, particularly in small doses.
Sometimes, in the interest of time, I skip on computational controls because I assume I know what’s going on. It’s worth coding in more controls.
In the interest of health, I should spend more time exercising.

Overall, I want to be more deliberate and efficient with my time.

Saturday, January 2, 2016

End of Year Review (Part II)

It's rarely fun to admit your failures, but here's a list of things I didn't mange to pull off in 2015.

What are the goals and milestones you missed in 2015?

It’s not like I immediately understood everything I tried to learn in the last year. There are a number of tools that I still don’t feel comfortable with:

There are several functionalities that I’ve played with in Python’s core language that would likely be useful, but I haven’t had the time to properly understand them, or how to efficiently implement them in my code.

Generators can save on memory. Things like lists of 10mers could be generators.
Decorators would be useful for cleanly modifying functions when I want to have several options for how to run some piece of core logic.
Inheritance is going to become more important as I continue to switch over to a more OOP approach.

Matplotlib has some serious disadvantages for plotting large datasets. I’ve made several attempts at investigating other plotting libraries, especially bokeh, but nothing has clicked with me. Part of the problem is that some of these libraries are still under very active development, so things are changing quickly. Another issue is that I only devote a small amount of time to trying to plot something (like an animated plot), and if it doesn’t work out quickly, I give up because I feel like I’m falling behind on analyses.
I was really hoping my Machine Learning course would give me the skills I needed to implement a neural network in Theano. Given the past few months, my guess is that I still have a serious time investment barrier before I would be able to properly incorporate a deep learning model into my research.
I picked up the basics of git during my third rotation. I use it to store backup copies of my code, but I haven’t used it enough to employ it properly. I realize that if I used git enough though (creating branches, etc.) it could potentially save me from some goofy disaster while editing code or messing around with files.
Similarly, I got fairly comfortable with the Killdevil cluster during my third rotation as well. Because of the difficulties of visualization of the Notebook on Killdevil, I often avoid using Killdevil. This means I haven’t kept up my command lines skills as much as I should have. I was originally hoping to be regularly using tools like ‘grep’, ‘awk’ and bash scripting by now, but I can avoid a lot of it with the Ubuntu GUI. Now, every time I do need the cluster for something, it slows me down considerably.
I’m in the same boat with regular expressions. Every once in awhile I find a problem suitable for regex, google until I have the answer, and move on. I still don’t have enough real understanding of the tool to have the solutions I find stick around in my memory, so I don’t really learn.
Statistics remains another weak area for me. The stats class I took at the beginning of grad school was a good introduction, but I need to find a way to stay fresh and continue to grow in this area. I still have a lot of trouble deciding when it’s appropriate to use what statistical test, for example.

There’s a theme here. Every couple months, I run into a problem that can be patched and/or worked around, or actually fixed with one of these tools. I decided I’m going to try out one of these tools, and spend a couple hours reading and trying some stuff out. I learn a little bit, but not enough to fix whatever problem I’m having. I become frustrated because I wanted to have the problem fixed by the end of the day and I know I can apply my work around. So, I just give up, apply my monkey patch, and move on. By the time I run into a similar problem a couple months later, the little bit I learned during those few researching and practicing hours has been completely forgotten and the cycle starts over. This is an unfortunate waste of time, especially since all of the tools listed are extremely valuable resources. I’ve been aware of this issue for a while now, and was hoping that I would have some method for avoiding it by now.

If anyone has any suggestions on how to avoid and overcome the phenomenon I've described, I'd love to hear them!

What Is Bioinformatics?

Code