Wednesday, December 30, 2015

End of Year Review (Part I)

My PI has requested that we answer a few questions as a reflection of the past year and as a mental preparation for the coming year. I figured it would be worthwhile to record at least a few of my answers here. This seems like a useful exercise and my answers are decently in depth. So the post isn't overwhelming, I'm going to dedicate a different post for each question.

What are the goals and milestones you met in 2015?

I have a vivid memory from Christmas break last year of asking a friend for advice on how to tackle a dataset that was larger than I knew how to handle. I had been doing scientific programming for about a month, and was worried that I was going to have to pick up C instead of continuing with Python. My code was slow and it seemed like weird things were happening with my memory usage. Not only did I have no clue what I was doing, but I also didn’t have any idea how I was supposed to go about learning what to do.

Part of the reason I remember the conversation is because the dataset I was struggling with was the same one that I use almost everyday now. Today, thinking back to my coding style a year ago makes me want to shake my head for a variety of reasons. I took the time to find some code from last year; it was as bad as I remembered. Here’s a small list of problems (in no particular order) with the code:
  1. Most of the scripts have virtually no comments or documentation.
  2. Everything is hard-coded. Occasionally, there are chunks of data stored in the script. There are random strings referring to data files throughout the code. There’s no way I would reliably have found and changed all those strings if I wanted to use different data.
  3. There’s only minimal use of libraries like numpy and scipy; I didn’t feel very comfortable with them yet.
  4. Several scripts have no functions whatsoever, just 100 lines of straight code. All of these scripts were obviously capable of doing exactly one thing.
  5. Other scripts have one main function that take a couple basic parameters (like the names of files). Unfortunately, I didn’t yet know how to incorporate command-line options into a python script. At the bottom of the script there’s a long list of strings corresponding to the files on which I had run the code. As I performed different runs of the script, I would comment out previous strings…

These coding problems had very real effects on my ability to do computational research. It was constantly difficult for me to remember which script I needed to bring up to run a particular analysis. Once I found the code, I often forgot to switch out one or more of the strings in my code, which meant it ran incorrectly and I had to rerun the code. And at the end of the week, I often didn’t remember how I managed to produce the data I had stored.

My primary goal for last year was to become comfortable with the Python scientific ecosystem. I’ve devoted a lot of time and effort to finding and exploring new computational tools in order to do so. While I’ve only explored a tiny fraction of all the available resources, I believe I achieved my goal insofar as I can now reliably reproduce my research and I know where to look when I need answers to a new question. Below is a list of tools that have become indispensable, and how they have allowed me to reach specific computational milestones:
  1. Surprisingly, has been a great resource. The community is active and new packages, updates, and other resources are constantly posted and discussed. I now have a way to stay up to date with virtually everything Python has to offer as a data analysis tool.
  2. Numpy/Scipy/Pandas/Matplotlib/sklearn are now part of my everyday tools. I understand their APIs, documentation, and am generally familiar with their capabilities and limitations. This means that I can much more quickly import, manipulate, analyze, plot, and save large datasets than I could last year. A great example is using Numpy and linear algebra for computing Pearson’s correlation ~500 times faster than I was previously.
  3. Discovering the Ipython/Jupyter Notebook has been the biggest game changer for me. Using the notebook allows me to essentially record everything I do in a month. A lot of code that I write has a one-and-done functionality. There’s no reason for me to store it as its own script. Now I don’t have to. I can make notes to myself in the same location as my code, save graphs directly below the code that made them, and section off my projects in a way that makes sense. Basically, I now have a method for reproducibility that has stopped my code directory structure from becoming an incomprehensible disaster. Organization is not my natural forte, so having the capability to use the Notebooks feels like a major accomplishment.
  4. Another large benefit of the Notebook is that the .py files I do write are readable and reusable. When I write some code I know I’m going to need again, I take a bit of time and rewrite it in a .py file as a class. I properly break the code up into modular functions and remember to write a docstring for each function. Inevitably, when I go to rerun my code, I’m going to want to try something slightly different than I did the first time. Now, all of my core logic is abstracted away in a .py file. All I have to do is instantiate the class I want inside of the Notebook and choose the functions I want to use this time. All of my slight variations from run to run are immediately saved in the notebook.

While I’m still a long way away from a data analysis expert, I can honestly say that I no longer feel on the brink of disaster like I did this time last year. Improvements can be made in each of the four points listed, but I’m happy to say that I believe I’m on the right track.

No comments:

Post a Comment