How to Import XML Data - Buku Data Wrangling with Python

Because we have an understanding of the data, let’s import the file into a usable form for Python. To get the data out of XML form and into Python, we will write the fol‐

lowing code:

from xml.etree import ElementTree as ET tree = ET.parse('data-text.xml') root = tree.getroot()

data = root.find('Data') all_data = []

for observation in data:

record = {}

for item in observation:

lookup_key = item.attrib.keys()[0]

if lookup_key == 'Numeric':

rec_key = 'NUMERIC'

rec_value = item.attrib['Numeric']

else:

rec_key = item.attrib[lookup_key]

rec_value = item.attrib['Code']

record[rec_key] = rec_value all_data.append(record)

print all_data

As you can see, this is a little more complicated than the CSV and JSON examples.

Let’s take a closer look. Create a new file in your code editor and save it to the folder where you have been putting your code. Name it import_xml_data.py. Also, if you downloaded the data directly from the WHO site and not the book’s repository, rename the saved XML file data-text.xml and put it in the same folder as your new code.

First, let’s import ElementTree, part of the built-in library we are going to use to parse the XML:

from xml.etree import ElementTree as ET

XML Data | 57

2When using Python objects with a longer numeric and hex string, this is Python’s way of showing memory address information. It’s not necessary for our data wrangling needs so please ignore it if your memory addresses don’t look like ours.

As mentioned earlier, there are often multiple solutions to a prob‐

lem. While we use ElementTree in this example, there is another library you could use called ^lxml, and yet another called ^minidom. Because all three of these solutions can be used to solve the same problem, if you find a good example using one of the libraries, we encourage you to explore the data using another library as well. As you learn Python, choose libraries that seem easiest for you to understand (it just so happens that in many cases these are the best choices).

This împort statement has an extra component we did not have last time: ^{as ET}. We are importing ElementTree, but will refer to it as ÊT. Why? Because we are lazy and do not want to type out ElementTree every time we want to use the library. This is a common practice when importing classes or functions with long names, but it is not mandatory. The âs tells Python we want to use ÊT to represent ElementTree.

Next, we call the ^parse method on the ^ET class, which will parse data from the file‐

name we pass. Because we are parsing a file located in the same folder, the filename needs no file path:

tree = ET.parse('data-text.xml')

The ^parse method returns a Python object people normally store in a variable ^tree. When talking about XML, the tree is the whole XML object stored in a way Python can understand and parse.

To understand how to traverse the tree (and data contained therein), we begin at the root of the tree. The root is the first XML tag. To start at the root of the tree, we call the ^getroot function:

root = tree.getroot()

If you were to take a moment to print out ^root by adding ^{print root} after the previ‐

ous statement, you would find it prints out the Python representation of the root ele‐

ment in the XML tree (it should look like <Element 'GHO' at 0x1079e79d0>²). From this representation, we can quickly tell ElementTree identified the root or outermost tag of the XML document as an XML node with a tag name ^GHO.

Now that we’ve identified the root, we need to figure out how to access the data we want. After analyzing the data in this chapter’s CSV and JSON sections, we know what data we’d like to review. We need to traverse the XML tree and extract that same

data. In order to understand what we’re looking for, we need to understand the over‐

all structure and format of the XML tree. Here, we’ve slimmed down the XML file we are using and removed the data, so we can view just the core structure:

<GHO>

<Data>

<Observation>

<Dim />

<Value>

<Display>

</Display>

</Value>

</Observation>

<Observation>

<Dim />

<Value>

<Display>

</Display>

</Value>

</Observation>

</Data>

</GHO>

In reviewing this structure, we can see each “row” of data is held in an Observation tag. The data for each of these rows is then held in the ^Dim, ^Value, and ^Display nodes within each Observation node.

So far we have three lines of code. To investigate how we can use Python to extract these nodes, let’s add print dir(root) to the end of our current code, then save the file and run it on the command line:

python import_xml_data.py

You will see all the methods and properties of the ^root variable. Your code should look like this:

from xml.etree import ElementTree as ET tree = ET.parse('data-text.xml') root = tree.getroot()

print dir(root)

XML Data | 59

When you run the file, you should see this output:

['__class__', '__delattr__', '__delitem__', '__dict__', '__doc__', '__format__', '__getattribute__', '__getitem__', '__hash__', '__init__', '__len__', '__module__', '__new__', '__nonzero__', '__reduce__',

'__reduce_ex__', '__repr__', '__setattr__', '__setitem__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_children', 'append', 'attrib', 'clear', 'copy', 'extend', 'find', 'findall', 'findtext', 'get',

'getchildren', 'getiterator', 'insert', 'items', 'iter', 'iterfind', 'itertext', 'keys', 'makeelement', 'remove', 'set', 'tag', 'tail', 'text']

Let’s assume our file is too large for us to open, and that we don’t know the structure of the file. This is often the case with larger XML datasets. What can we do? Let’s start by calling ^dir(root) to review the ^root object’s methods. We notice the getchildren method as a potential solution for seeing children in the Observation nodes. After reading the latest documentation and a question on Stack Overflow, we find the getchildren method will return the subelements, but the method has been depre‐

cated. When a method you want to use is or will soon become deprecated, you should attempt to use what the authors of the library suggest as a replacement.

When a method, class, or function is deprecated, it means this func‐

tionality will likely be removed from future releases of the library or module. For this reason, you should always avoid using depre‐

cated methods or classes and make sure you read through the doc‐

umentation, as the authors have likely recommended what to use going forward.

Based on what the documentation is recommending, if we want to view subelements for the root tree, we should use ^list(root). If we have a very large file, returning direct subelements will give us a view of the data and its structure, without over‐

whelming us with too many lines of output. Let’s try that.

Replace this line:

print dir(root)

with this line:

print list(root)

Run the file again from your command line. You should end up with the following output, which is a list of ^Element objects (for our purposes, elements are XML nodes):

[<Element 'QueryParameter' at 0x101bfd290>,

<Element 'QueryParameter' at 0x101bfd350>,

<Element 'QueryParameter' at 0x101bfd410>,

<Element 'QueryParameter' at 0x101bfd590>,

<Element 'QueryParameter' at 0x101bfd610>,

<Element 'QueryParameter' at 0x101bfd650>,

<Element 'Copyright' at 0x101bfd690>,

<Element 'Disclaimer' at 0x101bfd710>,

<Element 'Metadata' at 0x101bfd790>,

<Element 'Data' at 0x102540250>]

The list contains ^Element objects called QueryParameter, ^Copyright, ^Disclaimer, Metadata, and ^Data. We can traverse these elements and explore the contents so we better understand how to extract the data we are after.

When searching for data stored in an XML tree, the ^Data element is likely a good place to start. Now we have found the ^Data element, and we can focus on that subele‐

ment. There are a couple of ways to get the ^Data element, but we will use the ^find method. The root element’s ^find method allows us to search for a subelement using the tag name. Then, we will get the ^Element’s children and see what we should do next.

Replace this line:

print list(root)

with this:

data = root.find('Data') print list(data)

There is also a ^findall method we could use. The difference between ^find and ^findall is that ^find will return the first match‐

ing element, while ^findall will return all of the matching ele‐

ments. We know there is only one Data Element, so we can use find instead of ^findall. If there was more than one ^Element, we would want to use the ^findall method to get the whole list of matching elements and iterate over them.

When you rerun your file with the new lines of code, you will see a dizzying output of a list of Observation elements. These are our data points. Although there is a lot of information output, you can tell it is a list because the last character is a ^], which symbolizes the end of a list.

Let’s iterate over the list of data. Each Observation represents a row of data, so they should have more information contained inside. We can iterate over those elements individually to see what subelements exist. With a Python ^Element objects, we can iterate over all of the subelements similar to how we would a list. Therefore, we can iterate over each Observation and iterate over each subelement in the Observation elements. This is our first time using a loop within a loop, and it should show us whether there are more subelements we can use to extract data.

XML Data | 61

Because XML stores data in nodes, subnodes, and attributes, you’ll often want to take the approach of exploring each node and sub‐

node (or element and subelement) until you get a good idea of not only how the data is structured but also how Python sees the data.

Replace this line:

print list(data)

with these lines:

for observation in data:

for item in observation:

print item

and rerun the file.

The output is a bunch of ^Dim and ^Value objects. Let’s try a couple of different ways to explore what might be contained in these elements. There are a few ways you can view data within Python’s ^Element object. One of the attributes of every ^Element node is ^text, which displays the text contained inside the node.

Replace this line:

print item

with this line:

print item.text

and rerun the file.

What happened? You should have gotten a lot of ^None values back. This is because item.text does not exist for those elements because there is no text between the ele‐

ments’ tags. Look at how <Dim /> is structured in the data samples. For example:

<Dim Category="YEAR" Code="2000"/>

In Python, ^item.text is only useful if your elements are structured with text in the nodes, like the following:

<Dim Category="YEAR">2000</Dim>

For the second example, ^item.text returns ²⁰⁰⁰.

XML data can be structured in many different ways. The information we need is loca‐

ted in the XML; it just wasn’t in the first place we looked. Let’s continue to explore.

Another place to look is in child elements. Child elements are subelements of a parent element. Let’s check if we have any child elements. Replace this line:

print item.text

with this line:

print list(item)

When you rerun the code with that change, the output shows that some (but not all) elements have children. That’s interesting! These are actually ^Value elements. Look at how these are structured in the data sample:

<Value>

<Display>

</Display>

</Value>

If we want to explore those child elements, we’ll need another loop similar to the loop we wrote to go through the items in each Observation.

There is another method we can call for each ^Element object in Python, called attrib, which returns the attributes for each node. As we know from reviewing XML structure, if nodes don’t have values between the tags, they usually have attributes within the tags.

To see what attributes we have in our nodes, replace:

print list(item)

with this line:

print item.attrib

When we rerun the code, we see the data contained in the attributes output as a bunch of dictionaries. Rather than storing each element and the attributes in separate dictionaries, we want each full row of data in a dictionary together. Here is one record from our ^attrib output:

{'Category': 'PUBLISHSTATE', 'Code': 'PUBLISHED'}

{'Category': 'COUNTRY', 'Code': 'ZWE'}

{'Category': 'WORLDBANKINCOMEGROUP', 'Code': 'WB_LI'}

{'Category': 'YEAR', 'Code': '2012'}

{'Category': 'SEX', 'Code': 'BTSX'}

{'Category': 'GHO', 'Code': 'WHOSIS_000002'}

{'Category': 'REGION', 'Code': 'AFR'}

{'Numeric': '49.00000'}

Because we ended up with a dictionary for every record in our CSV example, let’s try to put this output in a similar form. The keys in our XML data dictionary will be slightly different, because the WHO does not provide the same data in the XML data‐

set as in the CSV dataset. We will work to get our data in the following form, but ignore the difference in key names. Ultimately, this will not affect how we use the data.

Just to remind you, here is a sample record from the CSV reader:

XML Data | 63

{

'Indicator': 'Healthy life expectancy (HALE) at birth (years)', 'Country': 'Zimbabwe',

'Comments': '', 'Display Value': '51',

'World Bank income group': 'Low-income', 'Numeric': '51.00000',

'Sex': 'Female', 'High': '', 'Low': '', 'Year': '2012',

'WHO region': 'Africa', 'PUBLISH STATES': 'Published' }

Here is the goal for a sample record using our XML data. We aim to have it in this format by the time we are done parsing our XML tree:

{

'COUNTRY': 'ZWE', 'GHO': 'WHOSIS_000002', 'Numeric': '49.00000', 'PUBLISHSTATE': 'PUBLISHED', 'REGION': 'AFR',

'SEX': 'BTSX',

'WORLDBANKINCOMEGROUP': 'WB_LI', 'YEAR': '2012'

}

Notice the ^High and ^Low fields are missing. If they existed in our XML dataset, we would add them to the keys of our new dictionary. The Display Value is also miss‐

ing. We decided not to include it, as it’s the same as the ^Numeric value.

Currently, your code should look like this:

from xml.etree import ElementTree as ET tree = ET.parse('data-text.xml') root = tree.getroot()

data = root.find('Data') for observation in data:

for item in observation:

print item.attrib

To create the data structure, we need to first create an empty dictionary for each record. We will use this dictionary to add keys and values, then we will append each record to a list, so we have a final list of all of our records (similar to our CSV data).

Let’s add our empty dictionary and an empty list to hold our data in. Add ^{all_data =} [] on a new line above the outer ^for loop and record = {} as the first line in the ^for loop, like so:

all_data = []

for observation in data:

record = {}

for item in observation:

print item.attrib

Now we need to figure out what our key and values are for each line and add them to our record’s dictionary. For each ^attrib call, we get a dictionary with one or more value and key combinations returned, like this:

{'Category': 'YEAR', 'Code': '2012'}

It looks like the value of the ^Category key (here, ^YEAR) should be the key for our dic‐

tionary, and the value of ^Code (here, ²⁰¹²) should be set as the value for that key. As you’ll recall from Chapter 2, a dictionary key should be easily used for lookup (like YEAR) and dictionary value should contain the value associated with that key (like 2012). With that knowledge, the preceding line would become:

'YEAR': '2012'

Update print item.attrib to print item.attrib.keys() in your code, then rerun it:

for item in observation:

print item.attrib.keys()

It will output the keys of each attribute dictionary. We want to check the keys so we can form the keys and values of the new item dictionary. We end up with two differ‐

ent outputs: ['Category', 'Code'] and ['Numeric']. Let’s tackle one at a time.

Based on our initial investigation, for the elements with both ^Category and ^Code, we know we need to use the ^Category values as keys and the ^Code values as values.

To do this, add ^[0] to the end of item.attrib.keys():

for item in observation:

lookup_key = item.attrib.keys()[0]

print lookup_key

This is called indexing. It will return the first item in the list.

XML Data | 65

Dalam dokumen Buku Data Wrangling with Python (Halaman 75-84)