Code

Code

Wednesday, December 30, 2015

End of Year Review (Part I)

My PI has requested that we answer a few questions as a reflection of the past year and as a mental preparation for the coming year. I figured it would be worthwhile to record at least a few of my answers here. This seems like a useful exercise and my answers are decently in depth. So the post isn't overwhelming, I'm going to dedicate a different post for each question.

What are the goals and milestones you met in 2015?

I have a vivid memory from Christmas break last year of asking a friend for advice on how to tackle a dataset that was larger than I knew how to handle. I had been doing scientific programming for about a month, and was worried that I was going to have to pick up C instead of continuing with Python. My code was slow and it seemed like weird things were happening with my memory usage. Not only did I have no clue what I was doing, but I also didn’t have any idea how I was supposed to go about learning what to do.

Part of the reason I remember the conversation is because the dataset I was struggling with was the same one that I use almost everyday now. Today, thinking back to my coding style a year ago makes me want to shake my head for a variety of reasons. I took the time to find some code from last year; it was as bad as I remembered. Here’s a small list of problems (in no particular order) with the code:
  1. Most of the scripts have virtually no comments or documentation.
  2. Everything is hard-coded. Occasionally, there are chunks of data stored in the script. There are random strings referring to data files throughout the code. There’s no way I would reliably have found and changed all those strings if I wanted to use different data.
  3. There’s only minimal use of libraries like numpy and scipy; I didn’t feel very comfortable with them yet.
  4. Several scripts have no functions whatsoever, just 100 lines of straight code. All of these scripts were obviously capable of doing exactly one thing.
  5. Other scripts have one main function that take a couple basic parameters (like the names of files). Unfortunately, I didn’t yet know how to incorporate command-line options into a python script. At the bottom of the script there’s a long list of strings corresponding to the files on which I had run the code. As I performed different runs of the script, I would comment out previous strings…

These coding problems had very real effects on my ability to do computational research. It was constantly difficult for me to remember which script I needed to bring up to run a particular analysis. Once I found the code, I often forgot to switch out one or more of the strings in my code, which meant it ran incorrectly and I had to rerun the code. And at the end of the week, I often didn’t remember how I managed to produce the data I had stored.

My primary goal for last year was to become comfortable with the Python scientific ecosystem. I’ve devoted a lot of time and effort to finding and exploring new computational tools in order to do so. While I’ve only explored a tiny fraction of all the available resources, I believe I achieved my goal insofar as I can now reliably reproduce my research and I know where to look when I need answers to a new question. Below is a list of tools that have become indispensable, and how they have allowed me to reach specific computational milestones:
  1. Surprisingly, https://www.reddit.com/r/Python has been a great resource. The community is active and new packages, updates, and other resources are constantly posted and discussed. I now have a way to stay up to date with virtually everything Python has to offer as a data analysis tool.
  2. Numpy/Scipy/Pandas/Matplotlib/sklearn are now part of my everyday tools. I understand their APIs, documentation, and am generally familiar with their capabilities and limitations. This means that I can much more quickly import, manipulate, analyze, plot, and save large datasets than I could last year. A great example is using Numpy and linear algebra for computing Pearson’s correlation ~500 times faster than I was previously.
  3. Discovering the Ipython/Jupyter Notebook has been the biggest game changer for me. Using the notebook allows me to essentially record everything I do in a month. A lot of code that I write has a one-and-done functionality. There’s no reason for me to store it as its own script. Now I don’t have to. I can make notes to myself in the same location as my code, save graphs directly below the code that made them, and section off my projects in a way that makes sense. Basically, I now have a method for reproducibility that has stopped my code directory structure from becoming an incomprehensible disaster. Organization is not my natural forte, so having the capability to use the Notebooks feels like a major accomplishment.
  4. Another large benefit of the Notebook is that the .py files I do write are readable and reusable. When I write some code I know I’m going to need again, I take a bit of time and rewrite it in a .py file as a class. I properly break the code up into modular functions and remember to write a docstring for each function. Inevitably, when I go to rerun my code, I’m going to want to try something slightly different than I did the first time. Now, all of my core logic is abstracted away in a .py file. All I have to do is instantiate the class I want inside of the Notebook and choose the functions I want to use this time. All of my slight variations from run to run are immediately saved in the notebook.

While I’m still a long way away from a data analysis expert, I can honestly say that I no longer feel on the brink of disaster like I did this time last year. Improvements can be made in each of the four points listed, but I’m happy to say that I believe I’m on the right track.

Monday, December 28, 2015

Teach Your Fourth Grader Python (Make a Game)

Welcome to the second post on introducing your child to Python! You should have Anaconda (or another version of Python) installed on your computer before continuing. As I mentioned before, I wrote this code for and with a class of 4th graders. Did all of them understand all of the concepts? No, of course not; that wasn't really the point. All of the students thought that the game was fun. All of the students gained a little bit of insight into the technology they use everyday. That's the point.

Similarly, I'm not going to attempt to fully describe everything that's going on in the code here either. The hope is that I can given enough of a description that you'll be able to either guess or google how to change a line if you wanted to do something similar. What if you decided you wanted to do division problems instead? What if you wanted to add a third player? Exercises like these are fun ways to familiarize yourself with what's going on in a script.

Remember, this is for Python 3 and won't work with Python 2. Leave a comment if you want the 2.7 version for any reason. Here is the code in full:
import time
import random

#Introduction
print("Hello, Players!")
print("Welcome to Multiplication Battles!")
print("")

#Input information
player1 = input("What is your name, Player 1? ")
player1_points = 0
player2 = input("What is your name, Player2? ")
player2_points = 0 
length_per_round = int(input("How many seconds do you want each round to be? "))
highest_num = int(input("What's the largest number you want to multiply? "))

#Launch Player1's game
print("")
input("{}! Hit Enter to start your time.".format(player1))
current_time = time.time()
end_time = current_time + length_per_round
while current_time <= end_time:
    
    #Do math
    a = random.randint(0, highest_num)
    b = random.randint(0, highest_num)
    c = a*b
    answer = int(input("What is {}x{}? ".format(a,b)))
  
    #Check if right or wrong
    if answer == c:
        print("You're Correct!")
        player1_points += 1
    else:
        print("Sorry. {}x{}={}.".format(a,b,c))
  
    #Must update the time at the end of each loop.
    current_time = time.time()

print("You scored {} points!".format(player1_points))

#Get ready for Player2
print("")
input("{}! Hit Enter to start your time.".format(player2))
current_time = time.time()
end_time = current_time + length_per_round
while current_time <= end_time:
  
    #Do math
    a = random.randint(0, highest_num)
    b = random.randint(0, highest_num)
    c = a*b
  
    #Check if right or wrong
    answer = int(input("What is {}x{}? ".format(a,b)))
    if answer == c:
        print("You're Correct!")
        player2_points += 1
    else:
        print("Sorry. {}x{}={}.".format(a,b,c))
    
    #Must update the time at the end of each loop.
    current_time = time.time()
print("You scored {} points!".format(player2_points))

#Decide a winner
print("")
if player1_points > player2_points:
  print("{} wins! Good job!".format(player1))
elif player2_points > player1_points:
  print("{} wins! Good job!".format(player2))
else:
  print("It was a tie! You two should play again!")


Now we'll go through a section by section breakdown. Any line that begins with a # is a comment, and doesn't effect the way the code runs. Think of them as notes to yourself that the computer ignores.
import time
import random
These two lines give us access to other pieces of code we'll use later in the script. This page lists all of the 'libraries' that come standard with Python.
#Introduction
print("Hello, Players!")
print("Welcome to Multiplication Battles!")
print("")print ""
This is the beginning of the game. The keyword 'print' is how text shows up in the console when running python. Notice that everything being printed in inside of quotation marks. This just indicates that we are printing some normal text, technically called a 'string'. Other things, like integers, for example, can also be printed.
#Input information
player1 = input("What is your name, Player 1? ")
player1_points = 0
player2 = input("What is your name, Player 2? ")
player2_points = 0 
length_per_round = int(input("How many seconds do you want each round to be? "))
highest_num = int(input("What's the largest number you want to multiply? "))
Before launching into questions, the game collects a little bit of information about who's playing. The first thing we do is create 'player1' by asking the game players to input the name of the first player. Whatever the player types in will be stored inside of the variable 'player1'. We need to keep track of how many points the first person has, so we'll create another variable called player1_points. Since the game hasn't started, the first player currently has 0 points. The reason that the words on the left hand side of the equals sign are called variables is because their values can change over the course of the program. When the first player correctly answers a question, they'll gain a point and we'll overwrite the score stored in 'player1_points'. We'll see how this works later. Variables are a pretty complicated concept for 4th graders. The idea was one of the biggest difficult while teaching this course. If anyone has difficulty understanding what's going on here, I highly recommend checking out 'How to Automate the Boring Stuff'.

We then repeat the process for the second player, and collect their information. Lastly, we need to know how long we want each round to last and what the largest number is that we feel comfortable multiplying. Maybe we want to go easy and play a game for 30 seconds only multiplying up to 10. Or we could go for something more challenging and give ourselves a minute to multiply numbers up to 20. Whatever you want to do. Notice the 'int' surrounding the 'input' on the last two lines. 'int' stands for 'integer', and we use this because we want Python to treat our entries in these lines like numbers instead of 'strings'.
#Launch Player1's game
print("")
input("{}! Hit Enter to start your time.".format(player1))
current_time = time.time()
end_time = current_time + length_per_round
Line number 3 is a nice way to pause the game until the first player is ready. But what's going on with the '{}'? It's just a placeholder in the text. When it is printed to the console, the first player's name will be displayed in the sentence. So, if I were playing, for example, the prompt would read, 'Jessime! Hit Enter to start your time.' As soon as I hit enter, the next line of code would execute.
The next line is where one of our imports come in handy. We're just going to access the computer's clock to store the current time. We need to do this because we want the player to answer questions for 30 seconds, or however long was chosen. So we need to know what time the player started. We then calculate when the player's time is up (by adding up our current time and how long we want to play), and store it in the 'end_time' variable.
while current_time <= end_time:
    
    #Do math
    a = random.randint(0, highest_num)
    b = random.randint(0, highest_num)
    c = a*b
    answer = int(input("What is {}x{}? ".format(a,b)))
This is where the first player begins getting questions. The first line is what's known as a while loop. Loops are another tough subject; use the link and other resources on Google to get more information if necessary. What's happening here is quite straightforward though. We're telling Python 'Hey, execute this next bit of code over and over until the first player runs out of time'. The 'next bit of code' means all of the lines which start 4 spaces over. The code that we're going to execute over and over is:

  1. Generate a question
  2. Ask the player the question
  3. Give the player a point if they answer correctly
  4. Tell the player the correct answer if they miss the question
  5. Update the time and see if the 30 seconds is up or not
These five steps will be repeated. Lines 4-7 in the code above knock out the first two steps. We create 'a' and 'b' by using the 'random' library to generate a random integer in the range of 0 to our highest number (12, for example). The correct solution to the question will be 'a' times 'b', which we store as 'c'. Then we ask the question to the player and store their guess as 'answer'.

        #Check if right or wrong
        if answer == c:
            print("You're Correct!")
            player1_points += 1
        else:
            print("Sorry. {}x{}={}.".format(a,b,c))
Note: This should be indented 4 spaces, but the syntax highlighter won't allow is. Look at the whole code or the player2 example for proper indentation. 
At this point, one of two things can happen. Either the player is right, or they're wrong. Line two says, if the answer the player gave is equivalent to the correct answer, execute the next bit of code. Again, the 'next bit of code' means all of the lines that are indented 4 spaces over. So, if the player get's the question correct, we're going to do two things. We print to the console letting the player know they answered correctly. We then increment 'player1_points' by one. That is, if the question is the first correctly answered, 'player1_points' goes from 0 to 1. If the first player has 4 points already, then 'player1_points' will now be equal to 5. If the player did not correctly answer the question, we print the correct equation.

    #Must update the time at the end of each loop.
    current_time = time.time()
This line is absolutely key to the while loop. It's the last line of the while loop, which means, since we're in a loop, we're about to jump back up to line 22 (of the whole code). Once we are back at 22, we're going to reevaluate if we have any time left. The only way that the 'current_time' variable will be accurate is if we update it by 'current_time = time.time()'. If we don't have this line, the while condition will always be true, you'll enter an infinite loop, and you'll have to shutdown python manually. Make sure to have this line.

Once the current time is greater than the end time, we'll exit out of the while loop, and the first player's turn will be over.
print("You scored {} points!".format(player1_points))
We can let the first player know how many points they scored. It may not look like it from the line count, but at this point, we're pretty much finished. A majority of the rest of the code is a duplicate of what we just did, but for the second player.
#Get ready for Player2
print("")
input("{}! Hit Enter to start your time.".format(player2))
current_time = time.time()
end_time = current_time + length_per_round
while current_time <= end_time:
  
    #Do math
    a = random.randint(0, highest_num)
    b = random.randint(0, highest_num)
    c = a*b
  
    #Check if right or wrong
    answer = int(input("What is {}x{}? ".format(a,b)))
    if answer == c:
        print("You're Correct!")
        player2_points += 1
    else:
        print("Sorry. {}x{}={}.".format(a,b,c))
    
    #Must update the time at the end of each loop.
    current_time = time.time()
print("You scored {} points!".format(player2_points))
Literally the only change here is that I've replaced 'player1' with 'player2'. Now that each player has had their turn, the only thing left to do is figure out who the winner is.
#Decide a winner
print("")
if player1_points > player2_points:
  print("{} wins! Good job!".format(player1))
elif player2_points > player1_points:
  print("{} wins! Good job!".format(player2))
else:
  print("It was a tie! You two should play again!")
This block of code should look very similar to the part where we decided if the players correctly answered the question or not. The only difference is that there are now three things that can happen:

  1. Player 1 can win.
  2. Player 2 can win.
  3. The players can tie.
Only one of these three statements will print, and we make the decision by comparing 'player1_points' to 'player2_points' and responding appropriately.

And that's it! Once the scripts executes this last line, it will automatically end and the game is over.

Teach Your Fourth Grader Python (Installing Anaconda)

It's really never too early to start teaching your child to code. You may think, "What's the point of teaching my kid to code so long before they have any idea what they want to do with their life?" The topic is worth a post of its own, but here are a couple of articles by Beth Werrell and Dan Crow that address how useful "computational thinking" is in a world where everything relies on software. Another possible issue that might come to mind is, "How do I teach my child to code when I don't even know how?" Thankfully, there are already some really great resources out there to help you learn and teach at the same time. One of my favorite books is Invent Your Own Computer Games with Python which is free online. Another resource, which I haven't used at all but generally gets good reviews is Teach Your Kids to Code: A Parent-Friendly Guide to Python Programming.

I had the opportunity earlier this semester to team up with a friend of mine to each a few hours worth of Python to her 4th grade class. Let me point out that this isn't nearly enough time to learn a programming language. It's enough time to see what coding is and complete a small project to get the students interested in coding. In 4th grade kids are still learning/perfecting their multiplication tables, so we decided to make a multiplication 2-player game. 

We'll go over two things. First, I'll show you how to easily install Python (it can be a little bit of a challenge on your own). Then, in the next post, I'll present the code that I wrote for class, and describe the basics of what it does. Again, this isn't going to provide enough information for you to "learn how to code". It's a fun first project to show you why learning to code is worthwhile. I would suggest using one of the resources I linked above to figure out all of the details about how the code works.

Installing Anaconda

Anaconda is a version of Python nicely bundled into a convenient package. Here are a list of steps to get up and running 

1. Go to https://www.continuum.io/downloads to download the program. 
2. You'll see options for Windows, OSX, and Linux. Choose the one that works for you.
3. You'll also have the option between Python 2.7 and 3.5. You should choose 3.5 unless you have a very specific reason for choosing 2.7.


4. Begin the download by clicking the 3.5 Graphical Installer I've highlighted in the red rectangle above.
5. Once the download is complete, open the installer.


6. Just follow the defaults and walk through the installer. 
7. Agree to the Terms of Service. 
8. Install for Just me.
9. Accept the default 'Destination Folder'
10. Click the 'Install' button.
11. Once the install as been completely finished, search for 'Spyder' in your computer's search. Look for an icon similar to the one below:


12. Note: Ignore that my icon has Python 2.7, it's what I use because of legacy issues.
13. Select the icon to launch Spyder. It will take a minute to launch. 


14. Congratulations! You're ready to start coding. 

In case you're wondering Spyder is what's known as an IDE. At its simplest, it is just a place to write and run your code. If yours doesn't look exactly the same as mine, don't worry. It's going to look different depending on whether you're using Windows, OSX or Linux.