First, why am I doing this, and why would you want to do this? I’m working with some ~40 odd documents built off standard templates, each of which is 20-30 pages long, and contains around 300 data points about different buildings. Currently, this data lives these documents, which are updated annually. I would like for this data to live in a database, and to automate the annual update process by pushing the data from the database to merge fields in MS Word using python as an intermediary between SQL & MS Word (I’ll discuss this at a later point).
For the data to end up in a normalized format in a database, I need a way to strip it out of the existing word documents; it’d obviously be preferable to not have to do this manually. So, I went in search of a way to parse docx files using python, and found python-docx.
I quickly ran into something of a snag in that, most of the information online for getting data out of docx files was focused on working with paragraphs. My documents are almost entirely based around tables, not paragraphs. Working with tables is somewhat more challenging than paragraphs, but, it does end up having advantages for what I need to do.
To access paragraphs, you refer to document.paragraphs which returns a list of paragraph objects in the document. Each object then has further objects, one of which is text, so document.paragraphs.text will return the text of the first paragraph.
Tables are similar, but with more layers. Table has no object text, but instead has rows and columns, among other things. Rows and columns then have an object cells, and cells have an object text, which is how you access the text of your table. To get the text of the first cell in the first row of the first table you have to go to document.tables.rows.cells.text. The same can be done using …columns… in place of rows. Using this you can get data out of tables into a data structure that makes sense for what you’re doing.
The solution I working with right now is to use the tables>rows>cells as a 3 point coordinate structure (cells are indexed left to right when accessed via rows). I constructed a series of lists containing the coordinates of the data I’m looking for, then wrapped those lists into a list of lists.
By iterating through data_wrapper and swapping out the coordinates, Im able to access all the data I need. For the purpose of this demonstration we’re just printing the text of each cell, as I take further steps in this project I’ll likely be putting data into a dictionary to be referenced by a key that will correspond with how it’ll be stored in SQL and then accessed by python doing the mail-merge down the line.
Since my documents are all built off one standard template, I should (emphasis on should) be able to do the coordinate mapping manually once and then use it across the entire document library to pull all the data in in one fell swoop.
My next task is parsing content control fields, which a lot of the data stored in these documents is contained within. Python-docx doesn’t seem read these fields as text, so I have to tinker a bit to figure out how to make everything happy.
Some resources the got me here: