Tag Archives: python

GPT, LLM, and Langchain Links

This post is a compilation of links I’ve seen recently related to running GPT or other language models locally and/or with customizations. The ecosystem around GPT & Large Language Models is changing quickly but these are good places to get started.

Run a Chatbot Locally

Run a Chatbot on PDFs using Langchain

Make a QA Chatbot on your own text

NLTK 3 Changes

NLTK 3 has quite a number of changes from NLTK 2, many of which will break old code. You can see a list of documented changes in the wiki page, Porting your code to NLTK 3.0. Below are the major changes I encountered while working on the NLTK 3 Cookbook.

Probability Classes

The FreqDist api has changed. It now inherits from collections.Counter, which implements most of the previous functionality, but in a different way. So instead of <span class="pre">fd.inc(tag)</span>, you now need to do <span class="pre">fd[tag]</span> <span class="pre">+=</span> <span class="pre">1</span>.

<span class="pre">fd.samples()</span> doesn’t exist anymore. Instead, you can use <span class="pre">fd.most_common()</span>, which is a method of collections.Counter that returns a list that looks like <span class="pre">[(word,</span> <span class="pre">count)]</span>.

ConditionalFreqDist now inherits from collections.defaultdict (one of my favorite Python data structures) which provides most of the previous functionality for free.

WordNet API

NLTK 3 has changed many wordnet Synset attributes to methods:

  • <span class="pre">syn.definition</span> -> <span class="pre">syn.definition()</span>
  • <span class="pre">syn.examples</span> -> <span class="pre">syn.examples()</span>
  • <span class="pre">syn.lemmas</span> -> <span class="pre">syn.lemmas()</span>
  • <span class="pre">syn.name</span> -> <span class="pre">syn.name()</span>
  • <span class="pre">syn.pos</span> -> <span class="pre">syn.pos()</span>

Same goes for the Lemma class. For example, <span class="pre">lemma.antonyms()</span> is now a method.

Tagging

The <span class="pre">batch_tag()</span> method is now <span class="pre">tag_sents()</span>. The brill tagger API has changed significantly: <span class="pre">brill.FastBrillTaggerTrainer</span> is now <span class="pre">brill_trainer.BrillTaggerTrainer</span>, and the brill templates have been replaced by the tbl.feature.Feature interface with <span class="pre">brill.Pos</span> or <span class="pre">brill.Word</span> as implementations of the interface.

Universal Tagset

Simplified tags have been replaced with the universal tagset. So <span class="pre">tagged_corpus.tagged_sents(simplify_tags=True)</span> becomes <span class="pre">tagged_corpus.tagged_sents(tagset='universal')</span>. In order to make this work, TaggedCorpusReader should be initialized with a known tagset, using the <span class="pre">tagset</span> kwarg, so that its tags can be mapped to the universal tagset. Known tagset mappings are stored in <span class="pre">nltk_data/taggers/universal_tagset</span>. The <span class="pre">treebank</span> tagset is called <span class="pre">en-ptb</span> (PennTreeBank) and the <span class="pre">brown</span> tagset is called <span class="pre">en-brown</span>. These files are simply 2 column, tab separated mappings of source tag to universal tag. The function <span class="pre">nltk.tag.mapping.map_tag(source,</span> <span class="pre">target,</span> <span class="pre">source</span> <span class="pre">tag)</span> is used to perform the mapping.

Chunking & Parse Trees

The main change in chunkers & parsers is replacing the term node with label. RegexpChunkParser now takes a chunk <span class="pre">chunk_label</span> argument instead of <span class="pre">chunk_node</span>, while in the Tree class, the <span class="pre">node</span> attribute has been replaced with the <span class="pre">label()</span> method.

Classification

The SVM classifiers and scipy based <span class="pre">MaxentClassifier</span> algorithms (like <span class="pre">CG</span>) have been removed, but the addition of the SklearnClassifier more than makes up for it. This classifier allows you to make use of most scikit-learn classification algorithms, which are generally faster and more memory efficient than the other NLTK classifiers, while being at least as accurate.

Python 3

NLTK 3 is compatible with both Python 2 and Python 3. If you are new to Python 3, then you’ll likely be puzzled when you find that training the same model on the same data can result in slightly different accuracy metrics, because dictionary ordering is random in Python 3. This is a deliberate decision to improve security, but you can control it with the <span class="pre">PYTHONHASHSEED</span> environment variable. Just run <span class="pre">$</span> <span class="pre">PYTHONHASHSEED=0</span> <span class="pre">python</span> to get consistent dictionary ordering & accuracy metrics.

Python 3 has also removed the separate <span class="pre">unicode</span> string object, so that now all strings are unicode. But some of the NLTK corpus functions return byte strings, which look like <span class="pre">b"raw</span> <span class="pre">string"</span>, so you may need convert these to normal strings before doing any further string processing.

Here’s a few other Python 3 changes I ran into:

  • <span class="pre">itertools.izip</span> -> <span class="pre">zip</span>
  • <span class="pre">dict.iteritems()</span> doesn’t exist, use <span class="pre">dict.items()</span> instead
  • <span class="pre">dict.keys()</span> does not produce a list (it returns a view). If you want a list, use <span class="pre">dict.dict_keys()</span>

Upgrading

Because of the above switching costs, upgrading right away may not be worth it. I’m still running plenty of NLTK 2 code, because it’s stable and works great. But if you’re starting a new project, or want to take advantage of new functionality, you should definitely start with NLTK 3.

Python 3 Text Processing with NLTK 3 Cookbook

Python Text Processing with NLTK 3 Cookbook

After many weekend writing sessions, the 2nd edition of the NLTK Cookbook, updated for NLTK 3 and Python 3, is available at Amazon and Packt. Code for the book is on github at nltk3-cookbook. Here’s some details on the changes & updates in the 2nd edition:

First off, all the code in the book is for Python 3 and NLTK 3. Most of it should work for Python 2, but not all of it. And NLTK 3 has made many backwards incompatible changes since version 2.0.4. One of the nice things about Python 3 is that it’s unicode all the way. No more issues with ASCII versus unicode strings. However, you do have to deal with byte strings in a few cases. Another interesting change is that hash randomization is on by default, which means that if you don’t set the PYTHONHASHSEED environment variable, training accuracy can change slightly on each run, because the iteration order of dictionaries is no longer consistent by default.

In Chapter 1, Tokenizing Text and WordNet Basics, I added a recipe for training a sentence tokenizer using the PunktSentenceTokenizer. This is surprisingly easy, and you can find the code in chapter1.py.

Chapter 2, Replacing and Correcting Words, shows the additional languages supported by the SnowballStemmer. An unfortunate removal from this chapter is <span class="pre">babelizer</span>, which was a fun library to use, but is no longer supported by Yahoo.

NLTK 3 replaced <span class="pre">simplify_tags</span> with universal tagset mappings, so I updated Chapter 3, Creating Custom Corpora to show how to use these tagset mappings to get the universal tags.

In Chapter 4, Part-of-Speech Tagging, the last recipe shows how to use train_tagger.py from NLTK-Trainer to replicate most of the tagger training recipes detailed earlier in the chapter. NLTK-Trainer was largely inspired by my experience writing Python Text Processing with NLTK 2.0 Cookbook, after realizing that many aspects of training part-of-speech taggers could be encapsulated in a command line script.

Chapter 5, Extracing Chunks, adds examples for using train_chunker.py to train phrase chunkers.

Chapter 7, Text Classification, adds coverage of train_classifier.py, along with examples of using the SklearnClassifier, which provides access to many of the scikit-learn classification algorithms. The scikit-learn classifiers tend to be at least as accurate as NLTK’s classifiers, are often faster to train, and have much smaller memory & disk footprints. And since NLTK 3 removed support for scipy based <span class="pre">MaxentClassifier</span> algorithms and SVM classifiers, the choice of which classifers to use has become very easy: when in doubt, choose SklearnClassifier (code examples can be found in chapter7.py).

There are a few library changes in Chapter 9, Parsing Specific Data Types:

  • <span class="pre">timex</span> and SimpleParse recipes have been removed due to lack of Python 3 compatibility
  • uses beautifulsoup4 with examples of UnicodeDammit
  • chardet was replaced with charade, which is compatible with both Python 2 & 3. But since publication, charade was merged back into chardet and is no longer maintained. I recommend installing chardet and replacing all instances of the <span class="pre">charade</span> module name with <span class="pre">chardet</span>.

So if you want to learn the latest & greatest NLTK 3, pickup your copy of Python 3 Text Processing with NLTK 3 Cookbook, and checkout the code at nltk3-cookbook. If you like the book, please review it at Amazon or goodreads.

Instant PyGame Book Review

Pygame for Python Game Development How-toThis is a review of the book Instant Pygame for Python Game Development How-to, by Ivan Idris. Packt asked me to review the book, and I agreed because like many developers, I’ve thought about writing my own game, and I’ve been curious about the capabilities of pygame. It’s a short book, ~120 pages, so this is a short review.

The book covers pygame basics like drawing images, rendering text, playing sounds, creating animations, and altering the mouse cursor. The author has helpfully posted some video demos of some of the exercises, which are linked from the book. I think this is a great way to show what’s possible, while also giving the reader a clear idea of what they are creating & what should happen. After the basic intro exercises, I think the best content was how to manipulate pixel arrays with numpy (the author has also written two books on numpy: NumPy Beginner’s Guide & NumPy Cookbook), how to create & use sprites, and how to make your own version of the game of life.

There were 3 chapters whose content puzzled me. When you’ve got such a short book on a specific topic, why bring up matplotlib, profiling, and debugging? These chapters seemed off-topic and just thrown in there randomly. The organization of the book could have been much better too, leading the reader from the basics all the way to a full-fledged game, with each chapter adding to the previous chapters. Instead, the chapters sometimes felt like unrelated low-level examples.

Overall, the book was a quick & easy read, that rapidly introduces you to basic pygame functionality, and leads you on to more complex activities. My main takeaway is that pygame provides an easy to use & low-level framework for building simple games, and can be used to create more complex games (but probably not FPS or similar graphically intensive games). The ideal games would probably be puzzle based and/or dialogue heavy, and only require simple interactions from the user. So if you’re interested in building such a game in Python, you should definitely get a copy of Instant Pygame for Python Game Development How-to.

Monetizing the Text-Processing API with Mashape

This is a short story about the text-processing.com API, and how it became a profitable side-project, thanks to Mashape.

Text-Processing API

When I first created text-processing.com, in the summer of 2010, my initial intention was to provide an online demo of NLTK’s capabilities. I trained a bunch of models on various NLTK corpora using nltk-trainer, then started making some simple Django forms to display the results. But as I was doing this, I realized I could fairly easily create an API based on these models. Instead of rendering HTML, I could just return the results as JSON.

I wasn’t sure if anyone would actually use the API, but I knew the best way to find out was to just put it out there. So I did, initially making it completely open, with a rate limit of 1000 calls per day per IP address. I figured at the very least, I might get some PHP or Ruby users that wanted the power of NLTK without having to interface with Python. Within a month, people were regularly exceeding that limit, and I quietly increased it to 5000 calls/day, while I started searching for the simplest way to monetize the API. I didn’t like what I found.

Monetizing APIs

Before Mashape, your options for monetizing APIs were either building a custom solution for authentication, billing, and tracking, or pay thousands of dollars a month for an “enterprise” solution from Mashery or Apigee. While I have no doubt Mashery & Apigee provide quality services, they are not in the price range for most developers. And building a custom solution is far more work than I wanted to put into it. Even now, when companies like Stripe exist to make billing easier, you’d still have to do authentication & call tracking. But Stripe didn’t exist 2 years ago, and the best billing option I could find was Paypal, whose API documentation is great at inducing headaches. Lucky for me, Mashape was just opening up for beta testing, and appeared to be in the process of solving all of my problems 🙂

Mashape

Mashape was just what I needed to monetize the text-processing API, and it’s improved tremendously since I started using it. They handle all the necessary details, like integrated billing, plus a lot more, such as usage charts, latency & uptime measurements, and automatic client library generation. This last is one of my favorite features, because the client libraries are generated using your API documentation, which provides a great incentive to accurately document the ins & outs of your API. Once you’ve documented your API, downloadable libraries in 5 different programming languages are immediately available, making it that much easier for new users to consume your API. As of this writing, those languages are Java, PHP, Python, Ruby, and Objective C.

Here’s a little history for the curious: Mashape originally did authentication and tracking by exchanging tokens thru an API call. So you had to write some code to call their token API on every one of your API calls, then check the results to see if the call was valid, or if the caller had reached their limit. They didn’t have all of the nice charts they have now, and their billing solution was the CEO manually handling Paypal payments. But none of that mattered, because it worked, and from conversations with them, I knew they were focused on more important things: building up their infrastructure and positioning themselves as a kind of app-store for APIs.

Mashape has been out of beta for a while now, with automated billing, and a custom proxy server for authenticating, routing, and tracking all API calls. They’re releasing new features on a regular basis, and sponsoring events like MusicHackDay. I’m very impressed with everything they’re doing, and on top of that, they’re good hard-working people. I’ve been over to their “hacker house” in San Francisco a few times, and they’re very friendly and accomodating. And if you’re ever in the neighborhood, I’m sure they’d be open to a visit.

Profit

Once I had integrated Mashape, which was maybe 20 lines of code, the money started rolling in :). Just kidding, but using the typical definition of profit, when income exceeds costs, the text-processing API was profitable within a few months, and has remained so ever since. My only monetary cost is a single Linode server, so as long as people keep paying for the API, text-processing.com will remain online. And while it has a very nice profit margin, total monthly income barely approaches the cost of living in San Francisco. But what really matters to me is that text-processing.com has become a self-sustaining excuse for me to experiment with natural language processing techniques & data sets, test my models against the market, and provide developers with a simple way to integrate NLP into their own projects.

So if you’ve got an idea for an API, especially if it’s something you could charge money for, I encourage you to build it and put it up on Mashape. All you need is a working API, a unique image & name, and a Paypal account for receiving payments. Like other app stores, Mashape takes a 20% cut of all revenue, but I think it’s well worth it compared to the cost of replicating everything they provide. And unlike some app stores, you’re not locked in. Many of the APIs on Mashape also provide alternative usage options (including text-processing), but they’re on Mashape because of the increased exposure, distribution, and additional features, like client library generation. SaaS APIs are becoming a significant part of modern computing infrastructure, and Mashape provides a great platform for getting started.

Text Classification for Sentiment Analysis – NLTK + Scikit-Learn

Now that NLTK versions 2.0.1 & higher include the SklearnClassifier (contributed by Lars Buitinck), it’s much easier to make use of the excellent scikit-learn library of algorithms for text classification. But how well do they work?

Below is a table showing both the accuracy & F-measure of many of these algorithms using different feature extraction methods. Unlike the standard NLTK classifiers, sklearn classifiers are designed for handling numeric features. So there are 3 different values under the feats column for each algorithm. bow means bag-of-words feature extraction, where every word gets a 1 if present, or a 0 if not. int means word counts are used, so if a word occurs twice, it gets the number 2 as its feature value (whereas with bow it would still get a 1). And tfidf means the TfidfTransformer is used to produce a floating point number that measures the importance of a word, using the tf-idf algorithm.

All numbers were determined using nltk-trainer, specifically, python train_classifier.py movie_reviews <span class="pre">--no-pickle</span> <span class="pre">--classifier</span> sklearn.ALGORITHM <span class="pre">--fraction</span> 0.75. For int features, the option <span class="pre">--value-type</span> int was used, and for tfidf features, the options <span class="pre">--value-type</span> float <span class="pre">--tfidf</span> were used. This was with NLTK 2.0.3 and sklearn 0.12.1.

algorithm feats accuracy neg f-measure pos f-measure
BernoulliNB bow 82.2 82.7 81.6
BernoulliNB int 82.2 82.7 81.6
BernoulliNB tfidf 82.2 82.7 81.6
GaussianNB bow 66.4 65.1 67.6
GaussianNB int 66.8 66.3 67.3
MultinomialNB bow 82.2 82.7 81.6
MultinomialNB int 81.2 81.5 80.1
MultinomialNB tfidf 81.6 83.0 80.0
LogisticRegression bow 85.6 85.8 85.4
LogisticRegression int 83.2 83.0 83.4
LogisticRegression tfidf 82.0 81.5 82.5
SVC bow 67.6 75.3 52.9
SVC int 67.8 71.7 62.6
SVC tfidf 50.2 0.8 66.7
LinearSVC bow 86.0 86.2 85.8
LinearSVC int 81.8 81.7 81.9
LinearSVC tfidf 85.8 85.5 86.0
NuSVC bow 85.0 85.5 84.5
NuSVC int 81.4 81.7 81.1
NuSVC tfidf 50.2 0.8 66.7

As you can see, the best algorithms are BernoulliNB, MultinomialNB, LogisticRegression, LinearSVC, and NuSVC. Surprisingly, int and tfidf features either provide a very small performance increase, or significantly decrease performance. So let’s see if we can improve performance with the same techniques used in previous articles in this series, specifically bigrams and high information words.

Bigrams

Below is a table showing the accuracy of the top 5 algorithms using just unigrams (the default, a.k.a single words), and using unigrams + bigrams (pairs of words) with the option <span class="pre">--ngrams</span> 1 2.

algorithm unigrams bigrams
BernoulliNB 82.2 86.0
MultinomialNB 82.2 86.0
LogisticRegression 85.6 86.6
LinearSVC 86.0 86.4
NuSVC 85.0 85.2

Only BernoulliNB & MultinomialNB got a modest boost in accuracy, putting them on-par with the rest of the algorithms. But we can do better than this using feature scoring.

Feature Scoring

As I’ve shown previously, eliminating low information features can have significant positive effects. Below is a table showing the accuracy of each algorithm at different score levels, using the option <span class="pre">--min_score</span> SCORE (and keeping the <span class="pre">--ngrams</span> 1 2 option to get bigram features).

algorithm score 1 score 2 score 3
BernoulliNB 62.8 97.2 95.8
MultinomialNB 62.8 97.2 95.8
LogisticRegression 90.4 91.6 91.4
LinearSVC 89.8 91.4 90.2
NuSVC 89.4 90.8 91.0

LogisticRegression, LinearSVC, and NuSVC all get a nice gain of ~4-5%, but the most interesting results are from the BernoulliNB & MultinomialNB algorithms, which drop down significantly at <span class="pre">--min_score</span> 1, but then skyrocket up to 97% with <span class="pre">--min_score</span> 2. The only explanation I can offer for this is that Naive Bayes classification, because it does not weight features, can be quite sensitive to changes in training data (see Bayesian Poisoning for an example).

Scikit-Learn

If you haven’t yet tried using scikit-learn for text classification, then I hope this article convinces you that it’s worth learning. NLTK’s SklearnClassifier makes the process much easier, since you don’t have to convert feature dictionaries to numpy arrays yourself, or keep track of all known features. The Scikits classifiers also tend to be more memory efficient than the standard NLTK classifiers, due to their use of sparse arrays.

PyCon NLTK Tutorial Assistants

My PyCon tutorial, Introduction to NLTK, now has over 40 people registered. This is about twice as many people as I was expecting, but I’m glad so many people want to learn NLTK 🙂  Because of the large class size, it’d really helpful to have a couple assistants with at least some NLTK experience, including, but not limited to:

* installing NLTK
* installing & using NLTK on Windows
* installing & using nltk-trainer
* creating custom corpora
* using WordNet

If you’re interested in helping out, please read Tutorial Assistants and contact me, japerk — at — gmail. Thanks!

Fuzzy String Matching in Python

Fuzzy matching is a general term for finding strings that are almost equal, or mostly the same. Of course almost and mostly are ambiguous terms themselves, so you’ll have to determine what they really mean for your specific needs. The best way to do this is to come up with a list of test cases before you start writing any fuzzy matching code. These test cases should be pairs of strings that either should fuzzy match, or not. I like to create doctests for this, like so:

def fuzzy_match(s1, s2):
	'''
	>>> fuzzy_match('Happy Days', ' happy days ')
	True
	>>> fuzzy_match('happy days', 'sad days')
	False
	'''
	# TODO: fuzzy matching code
	return s1 == s2

Once you’ve got a good set of test cases, then it’s much easier to tailor your fuzzy matching code to get the best results.

Normalization

The first step before doing any string matching is normalization. The goal with normalization is to transform your strings into a normal form, which in some cases may be all you need to do. While 'Happy Days' != ' happy days ', with simple normalization you can get 'Happy <span class="pre">Days'.lower()</span> == ' happy days '.strip().

The most basic normalization you can do is to lowercase and strip whitespace. But chances are you’ll want to more. For example, here’s a simple normalization function that also removes all punctuation in a string.

import string

def normalize(s):
	for p in string.punctuation:
		s = s.replace(p, '')

	return s.lower().strip()

Using this normalize function, we can make the above fuzzy matching function pass our simple tests.

def fuzzy_match(s1, s2):
	'''
	>>> fuzzy_match('Happy Days', ' happy days ')
	True
	>>> fuzzy_match('happy days', 'sad days')
	False
	'''
	return normalize(s1) == normalize(s2)

If you want to get more advanced, keep reading…

Regular Expressions

Beyond just stripping whitespace from the ends of strings, it’s also a good idea replace all whitespace occurrences with a single space character. The regex function for doing this is re.sub('\s+', s, ' '). This will replace every occurrence of one or more spaces, newlines, tabs, etc, essentially eliminating the significance of whitespace for matching.

You may also be able to use regular expressions for partial fuzzy matching. Maybe you can use regular expressions to identify significant parts of a string, or perhaps split a string into component parts for further matching. If you think you can create a simple regular expression to help with fuzzy matching, do it, because chances are, any other code you write to do fuzzy matching will be more complicated, less straightforward, and probably slower. You can also use more complicated regular expressions to handle specific edge cases. But beware of any expression that takes puzzling out every time you look at it, because you’ll probably be revisiting this code a number of times to tweak it for handling new cases, and tweaking complicated regular expressions is a sure way to induce headaches and eyeball-bleeding.

Edit Distance

The edit distance (aka Levenshtein distance) is the number of single character edits it would take to transform one string into another. Thefore, the smaller the edit distance, the more similar two strings are.

If you want to do edit distance calculations, checkout the standalone editdist module. Its distance function takes 2 strings and returns the Levenshtein edit distance. It’s also implemented in C, and so is quite fast.

Fuzzywuzzy

Fuzzywuzzy is a great all-purpose library for fuzzy string matching, built (in part) on top of Python’s difflib. It has a number of different fuzzy matching functions, and it’s definitely worth experimenting with all of them. I’ve personally found ratio and token_set_ratio to be the most useful.

NLTK

If you want to do some custom fuzzy string matching, then NLTK is a great library to use. There’s word tokenizers, stemmers, and it even has its own edit distance implementation. Here’s a way you could combine all 3 to create a fuzzy string matching function.

from nltk import metrics, stem, tokenize

stemmer = stem.PorterStemmer()

def normalize(s):
	words = tokenize.wordpunct_tokenize(s.lower().strip())
	return ' '.join([stemmer.stem(w) for w in words])

def fuzzy_match(s1, s2, max_dist=3):
	return metrics.edit_distance(normalize(s1), normalize(s2)) <= max_dist

Phonetics

Finally, an interesting and perhaps non-obvious way to compare strings is with phonetic algorithms. The idea is that 2 strings that sound same may be the same (or at least similar enough). One of the most well known phonetic algorithms is Soundex, with a python soundex algorithm here. Another is Double Metaphone, with a python metaphone module here. You can also find code for these and other phonetic algorithms in the nltk-trainer phonetics module (copied from a now defunct sourceforge project called advas). Using any of these algorithms, you get an encoded string, and then if 2 encodings compare equal, the original strings match. Theoretically, you could even do fuzzy matching on the phonetic encodings, but that’s probably pushing the bounds of fuzziness a bit too far.

NLTK Overview at SF Python

On September 14, 2011, I’ll be giving a 20 minute overview of NLTK for the San Francisco Python Meetup Group. Since it’s only 20 minutes, I can’t get into too much detail, but I plan to quickly cover the basics of:

I’ll also be soliciting feedback for a NLTK Tutorial at PyCON 2012. So if you’ll be at the meetup and are interested in attending a NLTK tutorial, come find me and tell me what you’d want to learn.

Updated 9/15/2011: Slides from the talk are online – NLTK in 20 minutes