From semi-structured to structured data

NEN released a Python module called xml2csv on Github to extract structured data from standards using the ISO or NISO Standard Tag Set. 

Structured data is the easiest to search, organize and analyze, because it is usually contained in rows and columns and its elements can be mapped into fixed pre-defined fields. For the data to be machine-readable (a computer to be able to process the data) it must be structured. This is therefore an important milestone in creating machine-readable standards.

We would kindly ask you to help us out with our research, it will take 3 minutes tops! Filling in the form at the bottom of this page would help us greatly. Check it out by clicking here, or read through the story first and then find the form at the bottom of this page.

We have three options to create machine-readable standards (maar zijn dit er niet 3?)

  • Backward
    We start with structured data form the start
  • Forward
    We start with the data we have, lacking structure
  • Hybrid
    A hybrid approach

The content of most standard bodies however consists of XML documents nowadays. XML is a markup language and a form of semi-structured data. So how can we then transform this semi-structured data into structured data?

Using the xml2csv Python module you can extract several entities using only a few lines of code and export these to a CSV file. These entities include ICS codes, dates, references, terms and more, You can even extend the XML processor class to implement your own processor.

Technical deepdive

Let’s dive in with an example. We will extract all title information from an XML document. This includes the language, introductory title, main title, complementary title and full title. See the links below for the input (XML file) and the output (CSV file).

XML2CSV NEN

Input: XML -> Output: CSV!

This can by achieved by writing a few lines of code. See sample code below:

  1. # import required modules
  2. from xml2csv import TitleProcessor
  3. from csv import DictWriter
  4.
  5. # create reader to load XML file and writer to write CSV file
  6. reader = open('input.xml', 'r', encoding='utf-8')
  7. writer = DictWriter(open('output.csv', 'a'), delimiter=',', lineterminator='\n',  fieldnames=TitleProcessor.fieldnames)`
  8.
  9. # create a processor and call it’s process method
10. p = TitleProcessor(reader, writer)
11. p.process() 

Let’s walk through this example line by line. 

  • Line 1-2: 
    We import the required modules, including TitleProcessor for processing the data and DictWriter for writing the data to CSV
  • Line 4-5: 
    We load and XML file and create a DictWriter object. We pass in the delimiter, line terminator and the fieldnames property of the TitleProcessor class as an argument. This is essentially the header.
  • Line 7-8: 
    This is where the magic happens. First we create an instance of the Processor class and pass the reader and writer as arguments to the constructor. In this case we will be using the TitleProcessor since we want to extract title information. We then call the process method of the Processor object. And we’re done!

Survey

Our goal is to continuously improve standards so use will become easier every day. For this we need your feedback about ICS and this article. Please fill in the form just below this text to help us improve. It will take 3 minutes tops!

Want to see more?

For more information see the README file and the API documentation.

Want to view more?

Check out our open-source projecten on Github.

Would you like to continue this conversation? Contact us!

NEN innovatielab 2022