Programming Philosophy Links
- Software Is Hard
- Maker's Schedule, Manager's Schedule
- programming | Quotes Archive
- How to be a Programmer: A Short, Comprehensive, and Personal Summary
- The "free electron" programmer
- Software Carpentry
- Edited Contributions - Programmer 97-things
- It's OK Not to Write Unit Tests
- Psychology and Security Resource Page
- How To Make Life Suck Less (While Making Scalable Systems)
- The myth of “undesigned”
- The Problem with Design and Implementation
Machine Learning Links
- Feature Column from the AMS on singular value decomposition
- The Geomblog: Correlation Clustering: I don't like you, but I like them...
- linkiblog | How to Build a Popularity Algorithm You can be Proud of
- Semisupervised Learning Approaches
- Gaussian Processes for Machine Learning: Book webpage
- pybrain's pybrain at master - GitHub
- Introduction to Statistical Thought
- Netflix prize tribute: Recommendation algorithm in Python | This Number Crunching Life
How to Deploy hgwebdir.fcgi behind Nginx with Fab
If you're managing multiple mercurial repositories, it's nice to see them all in one place, using a simple web-based repository browser. There's various ways to publish mercurial repositories, but hgwebdir is the only method that supports multiple repos. Since I prefer fastcgi and nginx, I decided to use hgwebdir.fcgi, which unfortunately isn't documented on the mercurial wiki.
hgweb.config
Let's start by creating hgweb.config, which tells hgwebdir where the repos are and what the web UI should look like.
[paths] /REPO1NAME = /PATH/TO/REPO1 /REPO2NAME = /PATH/TO/REPO2 [web] base = style = monoblue
There's a few different included themes you can choose from, I like the monoblue style. The empty base= line is apparently required to make everything work.
hgwebdir.fcgi
Next, make a copy of hgwebdir.fcgi, which in Ubuntu can be found in /usr/share/doc/mercurial/examples. Below is a simplified version with all comments removed. The one line you may want to change is the path to hgweb.config on the server, but I'll assume you'll want it in /etc/mercurial.
from mercurial import demandimport; demandimport.enable()
from mercurial.hgweb.hgwebdir_mod import hgwebdir
from mercurial.hgweb.request import wsgiapplication
from flup.server.fcgi import WSGIServer
def make_web_app():
return hgwebdir("/etc/mercurial/hgweb.config")
WSGIServer(wsgiapplication(make_web_app)).run()
hg_server.conf
This is a simple nginx fastcgi config you can modify for your own purposes. It forwards all requests for hg.DOMAIN.COM to the hgwebdir.fcgi socket we'll be starting below.
server {
listen 80;
server_name hg;
server_name hg.DOMAIN.COM;
access_log /var/log/hg_access.log;
error_log /var/log/hg_error.log;
location / {
fastcgi_pass unix:/var/run/hgwebdir.sock;
fastcgi_param PATH_INFO $fastcgi_script_name;
fastcgi_param QUERY_STRING $query_string;
fastcgi_param REQUEST_METHOD $request_method;
fastcgi_param CONTENT_TYPE $content_type;
fastcgi_param CONTENT_LENGTH $content_length;
fastcgi_param SERVER_PROTOCOL $server_protocol;
fastcgi_param SERVER_PORT $server_port;
fastcgi_param SERVER_NAME $server_name;
}
}The hg_server.conf file will need a link from /etc/nginx/sites-enabled to its location in /etc/nginx/sites-available, assuming that you're using the default nginx config which includes every server conf found in /etc/nginx/sites-available.
fab hgweb restart_nginx
To make deployment easy, I use fab, so that if I make any changes to hgweb.config or hg_server.conf, I can simply run fab hgweb restart_nginx. For starting hgwebdir.fcgi, we can use spawn-fcgi, which usually comes with lighttpd, so you'll need that installed too.
hgweb copies hgweb.config and hgwebdir.fcgi to appropriate locations on the server, then starts the fastcgi process with a socket at /var/run/hgwebdir.sock.
restart_nginx copies hg_server.conf to the server and tells nginx to reload its config.
def hgweb():
env.runpath = '/var/run'
put('hgweb.config', '/tmp')
put('hgwebdir.fcgi', '/tmp')
sudo('mv /tmp/hgwebdir.fcgi /usr/local/bin/')
sudo('chmod +x /usr/local/bin/hgwebdir.fcgi')
sudo('mv /tmp/hgweb.config /etc/mercurial/hgweb.config')
sudo('kill `cat %s/hgwebdir.pid`' % env.runpath)
sudo('spawn-fcgi -f /usr/local/bin/hgwebdir.fcgi -s %s/hgwebdir.sock -P %s/hgwebdir.pid' % (env.runpath, env.runpath), user='www-data')
def restart_nginx():
put('hg_server.conf', '/tmp/')
sudo('mv /tmp/hg_server.conf /etc/nginx/sites-available/')
sudo('killall -HUP nginx')
Once you've got these commands in your fabfile.py, you can run fab hgweb restart_nginx to deploy.
hgrc
Now that you've got hgwebdir.fcgi running (you can make sure it works by going to http://hg.DOMAIN.COM), you'll probably want to customize the info about each repo by editing .hg/hgrc.
[web] description = All about my repo contacts = Me
And that's it, you should now have a fast web-based browser for multiple repos ![]()
mapfilter
Have you ever mapped a list, then filtered it? Or filtered first, then mapped? Why not do it all in one pass with mapfilter?
mapfilter?
mapfilter is a function that combines the traditional map & filter of functional programming by using the following logic:
- if your function returns false, then the element is discarded
- any other return value is mapped into the list
Why?
Doing a map and then a filter is O(2N), whereas mapfilter is O(N). That's twice as a fast! If you are dealing with large lists, this can be a huge time saver. And for the case where a large list contains small IDs for looking up a larger data structure, then using mapfilter can result in half the number of database lookups.
Obviously, mapfilter won't work if you want to produce a list of boolean values, as it would filter out all the false values. But why would you want to map to a list booleans?
Erlang Code
Here's some erlang code I've been using for a while:
mapfilter(F, List) -> lists:reverse(mapfilter(F, List, [])). mapfilter(_, [], Results) -> Results; mapfilter(F, [Item | Rest], Results) -> case F(Item) of false -> mapfilter(F, Rest, Results); Term -> mapfilter(F, Rest, [Term | Results]) end.
Has anyone else done this for themselves? Does mapfilter exist in any programming language? If so, please leave a comment. I think mapfilter is a very simple & useful concept that should be a included in the standard library of every (functional) programming language. Erlang already has mapfoldl (map-reduce in one pass), so why not also have mapfilter?
Chunk Extraction with NLTK
Chunk extraction is a useful preliminary step to information extraction, that creates parse trees from unstructured text. Once you have a parse tree of a sentence, you can do more specific information extraction, such as named entity recognition and relation extraction.
Chunking is basically a 3 step process:
- Tag a sentence
- Chunk the tagged sentence
- Analyze the parse tree to extract information
I've already written about how to train a part of speech tagger and a chunker, so I'll assume you've already done the training, and now you want to use your tagger and chunker to do something useful.
Tag Chunker
The previously trained chunker is actually a chunk tagger. It's a Tagger that assigns IOB chunk tags to part-of-speech tags. In order to use it for proper chunking, we need some extra code to convert the IOB chunk tags into a parse tree. I've created a wrapper class that complies with the nltk ChunkParserI interface and uses the trained chunk tagger to get IOB tags and convert them to a proper parse tree.
import nltk.chunk
import itertools
class TagChunker(nltk.chunk.ChunkParserI):
def __init__(self, chunk_tagger):
self._chunk_tagger = chunk_tagger
def parse(self, tokens):
# split words and part of speech tags
(words, tags) = zip(*tokens)
# get IOB chunk tags
chunks = self._chunk_tagger.tag(tags)
# join words with chunk tags
wtc = itertools.izip(words, chunks)
# w = word, t = part-of-speech tag, c = chunk tag
lines = [' '.join([w, t, c]) for (w, (t, c)) in wtc if c]
# create tree from conll formatted chunk lines
return nltk.chunk.conllstr2tree('\n'.join(lines))
Chunk Extraction
Now that we have a proper chunker, we can use it to extract chunks. Here's a simple example that tags a sentence, chunks the tagged sentence, then prints out each noun phrase.
# sentence should be a list of words
tagged = tagger.tag(sentence)
tree = chunker.parse(tagged)
# for each noun phrase sub tree in the parse tree
for subtree in tree.subtrees(filter=lambda t: t.node == 'NP'):
# print the noun phrase as a list of part-of-speech tagged words
print subtree.leaves()
Each sub tree has a phrase tag, and the leaves of a sub tree are the tagged words that make up that chunk. Since we're training the chunker on IOB tags, NP stands for Noun Phrase. As noted before, the results of this natural language processing are heavily dependent on the training data. If your input text isn't similar to the your training data, then you probably won't be getting many chunks.
Test Driven Development in Python
One of my favorite aspects of Python is that it makes practicing TDD very easy. What makes it so frictionless is the doctest module. It allows you to write a test at the same time you define a function. No setup, no boilerplate, just write a function call and the expected output in the docstring. Here's a quick example of a fibonacci function.
def fib(n):
'''Return the nth fibonacci number.
>>> fib(0)
0
>>> fib(1)
1
>>> fib(2)
1
>>> fib(3)
2
>>> fib(4)
3
'''
if n == 0:
return 0
elif n == 1:
return 1
else:
return fib(n - 1) + fib(n - 2)
If you want to run your doctests, just add the following three lines to the bottom of your module.
if __name__ == '__main__':
import doctest
doctest.testmod()
Now you can run your module to run the doctests, like python fib.py.
So how well does this fit in with the TDD philosophy? Here's the basic TDD practices.
- Think about what you want to test
- Write a small test
- Write just enough code to fail the test
- Run the test and watch it fail
- Write just enough code to pass the test
- Run the test and watch it pass (if it fails, go back to step 4)
- Go back to step 1 and repeat until done
And now a step-by-step breakdown of how to do this with doctests, in excruciating detail.
1. Define a new empty method
def fib(n): '''Return the nth fibonacci number.''' pass if __name__ == '__main__': import doctest doctest.testmod()
2. Write a doctest
def fib(n):
'''Return the nth fibonacci number.
>>> fib(0)
0
'''
pass
3. Run the module and watch the doctest fail
python fib.py
**********************************************************************
File "fib1.py", line 3, in __main__.fib
Failed example:
fib(0)
Expected:
0
Got nothing
**********************************************************************
1 items had failures:
1 of 1 in __main__.fib
***Test Failed*** 1 failures.4. Write just enough code to pass the failing doctest
def fib(n):
'''Return the nth fibonacci number.
>>> fib(0)
0
'''
return 0
5. Run the module and watch the doctest pass
python fib.py
6. Go back to step 2 and repeat
Now you can start filling in the rest of function, one test at time. In practice, you may not write code exactly like this, but the point is that doctests provide a really easy way to test your code as you write it.
Unit Tests
Ok, so doctests are great for simple tests. But what if your tests need to be a bit more complex? Maybe you need some external data, or mock objects. In that case, you'll be better off with more traditional unit tests. But first, take a little time to see if you can decompose your code into a set of smaller functions that can be tested individually. I find that code that is easier to test is also easier to understand.
Running Tests
For running my tests, I use nose. I have a tests/ directory with a simple configuration file, nose.cfg
[nosetests] verbosity=3 with-doctest=1
Then in my Makefile, I add a test command so I can run make test.
test:
@nosetests --config=tests/nose.cfg tests PACKAGE1 PACKAGE2PACKAGE1 and PACKAGE2 are optional paths to your code. They could point to unit test packages and/or production code containing doctests.
And finally, if you're looking for a continuous integration server, try Buildbot.
Programming as Design
Some say programming is engineering, others call it an art. A few might (mistakenly) think it's a science. But both the art and engineering can be encapsulated under the umbrella of design. The best design is functional art, and a huge part of the artistic beauty of a product is a result of carefully engineered functionality. Products that are not carefully designed and engineered generally suck to use, and that applies to everything from cameras to software APIs.
Design Principles
There are 4 major principles of graphic design.
Alignment
The principle of alignment is that everything on a page should be connected to something else on the page. The goal of alignment in graphic design is to create visual associations, often using a grid based layout. In software, we can apply this principle to the connectedness of data, such as the object inheritance hierarchy and relational data structures. Ideally, all your objects fit nicely into a well-defined hierarchy and your data structures relate to each other in an intuitive fashion. Of course, the real world of programming is never as clean as you'd like, but keep this principle in mind whenever you create a new object, add a new dependency, or modify relational structures.
- Does the object cleanly fit within the existing hierarchy? If not, do you need to change the new object, or re-align the hierarchy?
- How does this data structure relate to that other data structure? What will happen if the relations change? Will you need to re-align the relational structures?
Keeping your objects and data structures neatly aligned will result in easier to understand relations and hierarchies.
Proximity
The principle of proximity is that related items should be grouped together. Grouping things together is simple way to show relatedness. In software development, that generally means putting related functions into the same module, and related modules into the same package. Helper functions should be located near the functions that call them. Basically, group blocks of code in a logically consistent manner. And if possible, put documentation and tests close to the code too (python's doctest module provides a great way to do that). One of the major benefits of following this principle is that it reduces the amount of time you'll spend searching thru and understanding your own code. If all your code is organized in logical groups, and related functions are near each other in the same file, then it's much easier to find a particular block of code.
Contrast
The principle of contrast is that if two things are not the same, then make them very different. The goal with contrast is to make different things distinctive from each other. Naming is great place to apply this principle. Names are all you have to distinguish between objects, functions, variables, and modules, so make sure that your names are distinctive and descriptive. Good names can tell you exactly what something is, and even imply its properties and behavior. Use different naming styles for different types of things. Private variables could be prefixed with an underscore, like _private, versus public variables like public, and CamelCase class names, as in MyClassName. Having a distinctive naming style lets you know at a glance whether something is a class, variable, or function, making your code much more readable. Whatever naming style you choose, use it consistently.
Consistency
The principle of consistency, or repetition, is that you repeat design elements. Repetition helps patterns become internalized and instantly recognizable. For programming, that means keeping a consistent code style, with consistent naming practices. Also, try to use well known standard conventions and protocols, shared libraries, and design patterns. Your code should make sense, or at least be readable, to those familiar with the language and domain. You're not just writing code for yourself, you might be writing code for other programmers, maybe your manager, but most importantly, you're writing code for your future self. It always sucks coming back to code you haven't touched in months and not knowing what the hell is going on. Consistent style and software design can save you from that headache.
Programming as Design
If all this seems like obvious common sense to you, then great! But common sense isn't always so common. The point of this article is make you aware that everyday software programming is filled with design choices. Naming a variable is a design choice. Creating a new module is a design choice. The layout of your working directory is a design choice. Be conscious of these choices and use the above principles and to inform your decisions. The choices you make communicate how the software works and how the code fits together. Make every choice deliberate and justifiable. Use refactoring to improve the design without affecting the functionality.




