Win a free book at the February Python Book Contest

This month is a special month. It’s not because of Valentines day or even the exciting day where we see groundhogs. No, this month is special because I’m have a book contest where you, the reader, get to win something free for doing absolutely nothing more than posting a comment saying that you want one of the several books I have available in the contest.

So without getting into boring details I’ll keep this short. I’ve been reviewing a lot of books lately and I think it’s time to get some books into people’s hands to enjoy themselves. This month the giveaways are all Python oriented.

So, all you have to do is take a look at the following titles and post a comment here saying that you want one of them. At the end of the month two readers will be chosen via a random list sorting python script I’ve whipped up for just this purpose. You will then get an email from the publisher who will send a brand new e-copy of the book free of charge. I’ll also be reviewing these books at a later date for those that do not win the contest.



Python Text Processing with NLTK 2.0 Cookbook


Python 2.6 Text Processing: Beginners Guide


Python 2.6 Graphics Cookbook

Post a comment now and tell me which book you want!

Python for Automation: using pdsh for a menu-driven command execution environment

I’ve been playing around with some quick system automation scripts that are handy to use when you don’t want / need to setup a chef or puppet action. I like to keep all of my hostnames and login details in a MySQL database (a cmdb actually) but for this example we’ll just use a couple of nested lists. This script executes commands in parallel across the hosts you choose in the menu system via the “pdsh” command, so make sure you have that installed before running. Alternately you can change the command call to use ssh instead of pdsh for a serialized execution, but that’s not as fun or fast. With some customizations here and there you can expand this to operate parallelized jobs for simplifying daily work in database administration, usage reporting, log file parsing, or other system automation as you see fit. Here’s the code. Comments welcome as always!

#!/usr/bin/env python
## NAME: menu_parallel_execution.py
## DATE: 2011-02-04
## AUTHOR: Matt Reid
## WEBSITE: http://kontrollsoft.com
## EMAIL: mreid@kontrollsoft.com
## LICENSE: BSD http://www.opensource.org/licenses/bsd-license.php

import commands
import sys
import os
import operator

ver = sys.version.split(' ')[0].split(".")
major=ver[:1]
minor=ver[1:2]
version="%s.%s"%(major[0],minor[0])
if version in ('2.4','2.3','2.2','2.1','2.0'):
	pyver = "old"
else:
	from subprocess import Popen, PIPE, STDOUT, call
	pyver = "new"

def main():
        #This section of host definitions could be replaced by a database 
        #call if you are storing your hosts in that manner, but we'll keep 
        #this script simple for the time being.
	dc0 = [
		["host0-0","10.0.0.1","ssh-username"],
		["host0-1","10.0.0.2","ss-username"]]

	dc1 = [
		["host1-0","10.1.0.1","ssh-username"],
		["host1-1","10.1.0.2","ss-username"]]

	dc2 = [
		["host2-0","10.2.0.1","ssh-username"],
		["host2-1","10.2.0.2","ss-username"]]

	print '''[1] Datacenter-1
[2] Datacenter-2
[3] Datacenter-3
'''
	dc = int(raw_input("Datacenter ID: "))
	if dc == 1:
		hosts = dc0
	elif dc == 2:
		hosts = dc1
	elif dc == 3:
		hosts = dc3
		
	nodes = []
	stay = True
	while stay == True:
		i = 0
		nest = sorted(hosts, key=operator.itemgetter(0))
		for x in range(len(nest)):
			print "[%i] %s | %s"%(i,nest[i][1],nest[i][0])
			i+=1

		print "\nSelect node to add to execution list"
		ch = int(raw_input("ID: "))
		xx = [nest[ch][1],nest[ch][2]]
		nodes.append(xx)
		s = str(raw_input("\nAdd another node? [Y/n] "))
		if s == "n" or s == "N":
			stay = False

	if(pyver == "new"):
		addrs = ""
		for node in nodes:
			address = node[1]+"@"+node[0]+","
			address = address.rstrip("\n")
			addrs = addrs+address
		
		addrs = addrs.strip(",")
		cmd = str(raw_input("\nEnter the command to execute: "))				

		try:
			c = "pdsh -w %s %s"%(addrs,cmd)
			print "Executing: %s"%(c)
			call(c,shell=True)
		except:
			print "Failed to execute pdsh command: %s"%(c)
			sys.exit(1)

		sys.exit(0)

	if(pyver == "old"):
		print "Please upgrade to Python 2.6+"
		sys.exit(1)



## START
if __name__ == "__main__":
	try:
		retval = main()
	except (KeyboardInterrupt, SystemExit):
		sys.exit(1)

Reviewed: MySQL for Python by Albert Lukaszewski

Packt Publishing recently sent me a copy of MySQL for Python to review and after reading through the book I must say that I’m rather impressed at the variety of topics that the book covers.

It starts off with the basics of setting up MySQL for your testing/development needs by going over several of the common installation and configuration methods. After that it’s a quick intro for connection methods and simple error reporting for connections. The author gives a quick intro to CRUD and how it relates to databases and python before heading into the common tasks of simple queries. I was surprised to see some database profiling discussion; which is rather handy for a new coder or a person new to MySQL. Once the basics of Inserts/Selects/Updates/Deletes are covered, which is a rather quick read, there is a welcome discussion of transactions and commit methods – if you do not read this section and are new to MySQL then believe me, you’re missing a very important topic. Most people will gloss over the basics and head right to the more advanced chapters that feature exception handling, the all too common “the mysql server has gone away” error, date&time functions, aggregate functions, and metadata queries. These chapters were the most interesting to me as they covered some great code for python that I have not yet played around with. Previously I’ve done a lot of work on those topics with perl and php so seeing how they were done in python was a great treat. The code is concise, easy to read, and well explained.

A number of topics cover the time saving solutions that no one should be without. Namely, bulk data inserting, data formatting, row iteration, and CSV parsing. Logging methods for access and changes to the database are also covered, and in the end will save your development cycle a lot of time when you are troubleshooting app-to-db interaction.

Two chapters will be of interest to DBAs in particular, and possibly not as interesting to pure developers, of which these are the Disaster Recovery and MySQL Administration topics. The author covers offline backups as well as online hot backups, two sections that no DBA should be without. The code for this type of work is covered in a decent amount of discussion but, along with the other chapters in the book, the theory and background of the topic is also discussed which gives the new reader an understanding of “why” and not just left with the “how”. The administration section of the book covers user creation and permissions management, along with a bit of background on security involved with that task, and also goes into quite a lot of coverage on web-based GUI administration and command line interaction for admin purposes.

Overall I enjoyed the contents of the book and would recommend taking a look if you are new to Python and MySQL or are even looking for a quick reference to the common tasks of database driven application development. This book does not cover the common ORM database interactions you’re likely to see in an app like Django or Pylons, but it will give you a solid foundation on how python and MySQL interact without an abstraction layer. If you are writing quick admin code or building your own database interaction layer, then this book would do well to be in your collection.

You can find the book at Amazon or directly from Packt.

Simple Python: a job queue with threading

Every so often you need to use a queue to manage operations in an application. Python makes this very simple. Python also, as I’ve written about before, makes threading very easy to work with. So in this quick program I’ll describe via comments, how to make a simple queue where each job is processed by a thread. Integrating this code to read jobs from a mysql database would be trivial as well; simply replace the “jobs = [..." code with a database call to a row select query.

 
#!/usr/bin/env python
## DATE: 2011-01-20
## FILE: queue.py
## AUTHOR: Matt Reid
## WEBSITE: http://themattreid.com
from Queue import *
from threading import Thread, Lock

'''this function will process the items in the queue, in serial'''
def processor():
    if queue.empty() == True:
        print "the Queue is empty!"
        sys.exit(1)
    try:
        job = queue.get()
        print "I'm operating on job item: %s"%(job)
        queue.task_done()
    except:
        print "Failed to operate on job"

'''set variables'''
queue = Queue()
threads = 4
    
'''a list of job items. you would want this to be more advanced,
like reading from a file or database'''
jobs = [ "job1", "job2", "job3" ]

”’iterate over jobs and put each into the queue in sequence”’
for job in jobs:
     print “inserting job into the queue: %s”%(job)
     queue.put(job)

”’start some threads, each one will process one job from the queue”’
for i in range(threads):
     th = Thread(target=processor)
     th.setDaemon(True)
     th.start()

”’wait until all jobs are processed before quitting”’
queue.join() 

A simple load test script in Python

Lately I’ve had to do some environment load testing so I wrote this quick script. It can be modified as needed but the basic idea is that it spawns $x threads (–threads) and then sends two connections (or however many you want with –per-connection=) per thread to the URL (–url=). You can have it wait a configurable time between connections as well (–wait=).

The url is appended with a 32 character randomized string so that any database/caching on the backend of the site isn’t serving data from a warm cache. You can hunt down the string length for 32 and change it to whatever you want. Feel free to change and use as needed, just keep my info at top.

#!/usr/bin/python
################################################################################
## DATE: 2010-10-26
## AUTHOR: Matt Reid
## MAIL: mreid@kontrollsoft.com
## SITE: http://kontrollsoft.com
## LICENSE: BSD http://www.opensource.org/licenses/bsd-license.php
################################################################################

from __future__ import division
import threading
import sys
import urllib2
import select
import random
import string
import getopt
import time

class threader(threading.Thread):
    def __init__(self):
        threading.Thread.__init__(self)
    def run(self):
        global url
        global per
        global u
        for i in range(per):
            if wait > 0:
                time.sleep(wait)
            str = randstr(32)
            # IMPORTANT: this is where we append the search string to the main URL
            # you might need to change this for your site. 
            url = "%s/search/%s"%(u,str)
            print "polling url: %s"%(url)
            urllib2.urlopen(url)
        
def randstr(length):
    global url
    twoletters = [c+d for c in string.letters for d in string.letters]
    r = random.random
    n = len(twoletters)
    l2 = length//2
    lst = [None] * l2
    for i in xrange(l2):
        lst[i] = twoletters[int(r() * n)]
        if length & 1:
            lst.append(random.choice(string.letters))

    return "".join(lst)
            
def init_thread():    
    backgrounds = []
    for thread in range(threads):
        print "Spawning thread: %s"%(thread)
        background = threader()        
        background.start()
        backgrounds.append(background)
    for background in backgrounds:
        background.join()

def print_help():
    print '''loader.py - URL load test script
==================================================
Date: 2010-08-26
Website: http://themattreid.com
Author: Matt Reid
Email: themattreid@gmail.com
License: new BSD license
==================================================
Use the following flags to change default behavior
    
   Option                 Description
   --url=                 URL to test
   --per-connection=      Number of sequential reqests per connection (default 2)
   --threads=             Number of threads for url connections (default 50)
   --wait=                Time to wait in-between requests
   --help                 Print this message

   -u                     Same as --url
   -p                     Same as --per-connection
   -t                     Same as --threads
   -w                     Same as --wait
   -h                     Same as --help
   '''


def main():    
    init_thread()
    sys.exit(0)

if __name__ == "__main__":
    global threads #num threads/connections to open
    global u #url to hit
    global per #per connection url hits
    try:
        options, remainder = getopt.getopt(
            sys.argv[1:], 'ptuw', ['per-connection=',
                                   'threads=',
                                   'url=',
                                   'wait=',
                                   'help'])
    except getopt.GetoptError, err:
        print str(err) 
        sys.exit(2)
        
    for opt, arg in options:
        if opt in ('--per-connection'):
            per = int(arg)
        elif opt in ('--threads'):
            threads = int(arg)
        elif opt in ('--url'):
            u = arg
        elif opt in ('--wait'):
            wait = int(arg)
        elif opt in ('--help'):
            print_help()
            sys.exit(2)

    try:
        threads
    except NameError:
        print "No thread quantity specified."
        print_help()
        sys.exit(2)
    try:
        per
    except NameError:
        per = 2
    try:
        u
    except NameError:     
        print "No URL Specified"
        print_help()
        sys.exit(2)
    try:
        wait
    except NameError:
        wait=0
    
    main()

MySQL analytics: information_schema polling for table engine percentages

If you’ve ever needed to know how the data and index percentages per table engine were laid out on your MySQL server, but didn’t have the time to write out a query… here it is!

select 
(select (sum(DATA_LENGTH)+sum(INDEX_LENGTH))/(POW(1024,3)) as total_size from tables) as total_size_gb, 
(select sum(INDEX_LENGTH)/(POW(1024,3)) as index_size from tables) as total_index_gb, 
(select sum(DATA_LENGTH)/(POW(1024,3)) as data_size from tables) as total_data_gb, 

(select ((sum(INDEX_LENGTH) / ( sum(DATA_LENGTH) + sum(INDEX_LENGTH)))*100) as perc_index from tables) as perc_index,
(select ((sum(DATA_LENGTH) / ( sum(DATA_LENGTH) + sum(INDEX_LENGTH)))*100) as perc_data from tables) as perc_data,

(select ((sum(INDEX_LENGTH) / ( sum(DATA_LENGTH) + sum(INDEX_LENGTH)))*100) as perc_index from tables where ENGINE='innodb') as innodb_perc_index,
(select ((sum(DATA_LENGTH) / ( sum(DATA_LENGTH) + sum(INDEX_LENGTH)))*100) as perc_data from tables where ENGINE='innodb') as innodb_perc_data,

(select ((sum(INDEX_LENGTH) / ( sum(DATA_LENGTH) + sum(INDEX_LENGTH)))*100) as perc_index from tables where ENGINE='myisam') as myisam_perc_index,
(select ((sum(DATA_LENGTH) / ( sum(DATA_LENGTH) + sum(INDEX_LENGTH)))*100) as perc_data from tables where ENGINE='myisam') as myisam_perc_data,

(select (select sum(INDEX_LENGTH) from tables where ENGINE='innodb') / (select sum(INDEX_LENGTH) from tables))*100 as perc_total_index_innodb,
(select (select sum(DATA_LENGTH) from tables where ENGINE='innodb') / (select sum(DATA_LENGTH) from tables))*100 as perc_total_data_innodb,

(select (select sum(INDEX_LENGTH) from tables where ENGINE='myisam') / (select sum(INDEX_LENGTH) from tables))*100 as perc_total_index_myisam,
(select (select sum(DATA_LENGTH) from tables where ENGINE='myisam') / (select sum(DATA_LENGTH) from tables))*100 as perc_total_data_myisam

from tables limit 1\G

This will output something along the lines of:

          total_size_gb: 2.7505537783727
         total_index_gb: 0.018660545349121
          total_data_gb: 2.7318932330236
             perc_index: 0.6784
              perc_data: 99.3216
      innodb_perc_index: 0.6502
       innodb_perc_data: 99.3498
      myisam_perc_index: 17.5120
       myisam_perc_data: 82.4880
perc_total_index_innodb: 95.6713
 perc_total_data_innodb: 99.8607
perc_total_index_myisam: 4.3287
 perc_total_data_myisam: 0.1393

Easy MySQL: transaction isolation and ACID, the simple explanation

Clients often ask what the differences are between the various InnoDB isolation levels, or what ACID means. Here are some simple explanations for those that have not yet read the manual and committed it to memory.

READ UNCOMMITTED
Every select operates without locks so you don’t get consistency and might have dirt reads, which are potentially earlier versions of data. So, no ACID support here.

READ COMMITTED
Has consistent reads without locks. Each consistent read, even within the same transaction, sets and reads its own fresh snapshot.

REPEATABLE READ
The InnoDB default isolation level for ACID compliance. All reads within the same transaction will be consistent between each other – ie, the C in ACID. All writes will be durable, etc etc.

SERIALIZABLE
Same as REPEATABLE READ but MySQL converts regular select statements with preface of LOCK IN SHARED MODE when autocommit is enabled. If it’s disabled then each select is started in a separate transaction which will always make sure that reads are consistent. It also, uh, allows for XA distributed transactions support. You have to be using SERIALIZABLE to correctly use XA transactions.

===========================================================================

ATOMICITY
All transactions fail or no transactions fail. Basically that if a transaction fails because of a hardware issue, connection issue, etc – that  partial changes won’t commit. It’s 100% or 0% operation.

CONSISTENCY
Data being read by a select is all at the same state. So when you use a transaction you’re getting the most current and consistent data available. This is related to MVCC (multi version concurrency control)

ISOLATION
Nothing that’s being read is actively being changed by another transaction.  Your connection or transaction’s read is not going to be changed by another transaction while you’re dealing with that data.

DURABILITY
Changes to the database persist – basically that means that if a transaction is committed and the DB fails or server crashes your changes will be there – which is why innodb uses transaction log files (where data is kept before being written to disk. The engine will read the logs on next startup and commit any remaining transactions in the logs that did not make to disk based tables.)

Easy Python: multi-threading MySQL queries [updated]

There are many times when writing an application that single threaded database operations are simply too slow. In these cases it’s a matter of course that you’ll use multi-threading or forking to spawn secondary processes to handle the database actions. In this simple example for Python multi-threading you’ll see the how simple it is to improve the performance of your python app.

#!/usr/bin/env python
'''
@author: Matt Reid
@date: 2012-01-20
@purpose: example to show thread-safe MySQL connection pooling with multi-threaded SQL execution
@license: new BSD: http://goo.gl/Gndt6
@requires: Python 2.6, MySQL Server 5.0+, MySQL-python connector
@SQL table create statement:
  CREATE TABLE `test` (
  `id` int(8) NOT NULL AUTO_INCREMENT,
  `col_a` varchar(255) DEFAULT NULL,
  `col_b` varchar(255) DEFAULT NULL,
  `col_c` varchar(255) DEFAULT NULL,
  PRIMARY KEY (`id`)) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8;
'''

import threading
import sys
import MySQLdb
import pdb

class threader(threading.Thread):
    def __init__(self,db):
        threading.Thread.__init__(self)
        self.db = db
    def run(self):
        run_insert(self.db)

def run_insert(db):
    sql = "INSERT INTO `test` (`id`,`col_a`,`col_b`,`col_c`) VALUES (NULL,'0','0','0');"
    print "thread executing sql:%s"%(sql)
    try:
        cursor = db.cursor()
        cursor.execute(sql)
        cursor.close()
        db.commit()
    except:
        print "insert failed"

    return

def init_thread(): 
    backgrounds = []
    for db in connections:
        print "connection: %s"%(db)
        background = threader(db)
        background.start()
        backgrounds.append(background)

    for background in backgrounds:
        background.join()

def main():
    try:
        init_thread()
    except:
        print "failed to initiate threads"
    sys.exit(0)

if __name__ == "__main__":
    mysql_host = "localhost" 
    mysql_user = "root" 
    mysql_pass = "password" 
    mysql_port = int(3306)
    mysql_db = "test" 
    threads = int(4) #quantity of execution threads and size of connection pool

    connections = []
    for thread in range(threads):
        try:
            connections.append(MySQLdb.connect(host=mysql_host, user=mysql_user, passwd=mysql_pass, db=mysql_db, port=mysql_port))
            #pdb.set_trace()

        except MySQLdb.Error, e:
            print "Error %d: %s"%(e.args[0], e.args[1])
            pdb.set_trace()            
            sys.exit (1)

    main()
    

Output:

$> python ./test.py 
connection: <_mysql.connection open to 'localhost' at 2389850>
thread executing sql:INSERT INTO `test` (`id`,`col_a`,`col_b`,`col_c`) VALUES (NULL,'0','0','0');
connection: <_mysql.connection open to 'localhost' at 240da70>
thread executing sql:INSERT INTO `test` (`id`,`col_a`,`col_b`,`col_c`) VALUES (NULL,'0','0','0');
connection: <_mysql.connection open to 'localhost' at 24c6d70>
thread executing sql:INSERT INTO `test` (`id`,`col_a`,`col_b`,`col_c`) VALUES (NULL,'0','0','0');
connection: <_mysql.connection open to 'localhost' at 23c6290>
thread executing sql:INSERT INTO `test` (`id`,`col_a`,`col_b`,`col_c`) VALUES (NULL,'0','0','0');

* this script has been updated

Easy Python: display LVM details in XML

If you need to work with LVM in your scripts but haven’t found a good method to access details about Logical Volume Groups, here’s a simple Python script that will print the details about any volumes on your system. This could be useful for writing a partition check script for your MySQL data directory (if you’re not using a standard monitoring system like Nagios).

import sys
import os
import commands
import subprocess
import select

def lvm():
    print ""
    LVM_PATH = "/sbin"
    LVM_BIN = os.path.join(LVM_PATH, 'lvm')
    argv = list()
    argv.append(LVM_BIN)
    argv.append("lvs")
    argv.append("--nosuffix")
    argv.append("--noheadings")
    argv.append("--units")
    argv.append("b")
    argv.append("--separator")
    argv.append(";")
    argv.append("-o")
    argv.append("lv_name,vg_name,lv_size")

    process = subprocess.Popen(argv, stdout=subprocess.PIPE)
    output = ""
    out = process.stdout.readline()
    output += out
    lines = output.splitlines()
    for line in lines:
        line = line.strip()
        words = line.split(";")

        lvname = words[0].strip()
        vgname = words[1].strip()
        lv_size = int(words[2])
        print '''  
    %s
    %s
    %s
  '''%(lvname, vgname, lv_size)

    print ""

lvm()

Easy MySQL: how to backup databases to a remote machine

Here’s a simple answer to a simple question. “How do I run a backup of MySQL to another machine without writing to the local server’s filesystem?” – this is especially useful if you are running out of space on the local server and cannot write a temporary file to the filesystem during backups.

Method one – this writes a remote file.
mysqldump [options] [db_name|--all-databases]| gzip -c | ssh user@host.com "cat > /path/to/new/file.sql.gz"

Method two – this writes directly into a remote mysql server
mysqldump [options] [db_name|--all-databases]| mysql --host=[remote host] –user=root –password=[pass] [db_name]