Manipulating Excel files using Python part 1: Reading Excel Files

It is often the case that the freely available data online are in Excel format. If one has Excel, then one has the ability to do some sort of basic manipulation of the files. But if Excel is not available, or your analysis software does not read Excel files, there is another way: use Python to manipulate Excel files.

excelPreviewBefore continuing, let’s discuss the basic idea of Excel. This programs stores data in what is called aWorkbook. Think of this as the file on your computer (the .xls). A Workbook can then have one or more Sheets, the little tabs on the bottom left corner usually. Within each Sheet, you can find an arrangement of Cells in a matrix form, which are referenced by Column Name and Row Number (so the top left cell is A1, the one directly to its right is A2, the one below is B2, etc. The image should help shed some light if you are not familiar with the concept of spreadsheets.

In order to read Excel sheets in Python without using any Microsoft Office code, we need to use the excellentxlrd packge by Lingfo. The basic workflow is as follows:

  1. Open a workbook using book = xlrd.open_workbook(“filetoread.xls”)
  2. Access a sheet in the workbook using sheet = book.sheet_by_index(0) to open the first sheet, or if you know the name, sheet = book.sheet_by_name(“Name of Sheet”)
  3. Retrieve data using any of the multiple methods: cellValue =sheet.cell_value(rowx, colx), or to retrieve a whole column use colValues = sheet.col_values(colx, start_rowx=0, end_rowx=None) or the retrieve a whole row use rowValues = sheet.row_values(rowx, start_colx=0, end_colx=None)

The basic idea is very simple, and we can easily write a little Python script that will allow us to read Excel files without the need to use Excel itself. The following example will offer a preview of an Excel file’s first sheet using the first 5 rows and columns of a file.

1
2
3
4
5
6
7
8
9
10
import xlrd # Import the package
book = xlrd.open_workbook("sample.xls") # Open an .xls file
sheet = book.sheet_by_index(0) # Get the first sheet
for counter in range(5): # Loop for five times
# grab the current row
rowValues = sheet.row_values(counter,start_col=0, end_colx=4)
# Print the values of the row formatted to 10 characters wide
print "%-10s | %-10s | %-10s | %-10s | %-10s" % tuple(rowValues)
# Print row separator
print "-" *62

That is all there is to it. Quick, fast and no need for Excel whatsoever. Sample output is presented below:

sampleOutput