Video Of Day

Breaking News

Loading Oecd Information Amongst Python

One of the projects I stimulate got been juggling is edifice a packet to import information from the OECD. The OECD helpfully provides an API which allows for custom queries into their large information sets: https://data.oecd.org/api/ (for free!). Unless you lot actually similar XML or JSON (which I produce not), you lot desire to abide by a wrapper for the query in addition to download protocols. As a Python developer, the solution that worked best for me was the pandaSDMX Python library:  https://pandasdmx.readthedocs.io/en/latest/

The packet is highly object oriented, which way that I in all probability should stimulate got read the documentation. However, I was able to catch the information I wanted amongst a fleck of experimentation.

This was my initial stab of grabbing the LEI information in addition to and thus dumping it into a CSV (based on code from a interrogation on Stack Exchange). (The 'df' variable is a Pandas dataframe.)

oecd = pandasdmx.Request('OECD') data_response = oecd.data(resource_id='MEI_CLI', key='all?startTime=2018') df = data_response.write(data_response.data.series, parse_time=False) df.to_csv('c:\\Temp\\test_lei.txt', sep='\t')
 
The OECD information provides a twosome of challenges to piece of job with.

The kickoff number is that the OECD does non render only about of the meta information navigation tools provided past times the other world databases that pandaSDXM connects to. The information navigation calls done inward the pandaSDMX documentation volition non piece of job on the OECD information (which is only about other argue I largely skipped over the documentation).

The adjacent challenge amongst the OECD information is finding out where the information you lot desire live. The leading indicators are flora amongst the "Main Economic Indicators" (MEI) database, but that database is big, and the communication protocol used (SDMX-ML) is bloated. In club to give-up the ghost along the information transmission reasonable, you lot bespeak to operate a smaller subset of the database, which has its ain database identifier (such every bit 'MEI_CLI'). Of course, trying to abide by that database identifier takes a fleck of piece of job on the spider web page: you lot bespeak to abide by the database inward question, in addition to and thus extract the database code.

For example, the Composite Leading Indicators page is: https://www.oecd-ilibrary.org/economics/data/main-economic-indicators/composite-leading-indicators_data-00042-en

If you lot in addition to thus become to "Data", you lot instruct sent to: http://stats.oecd.org/viewhtml.aspx?datasetcode=MEI_CLI&lang=en, where you lot tin move read off the 'MEI_CLI' code (or instruct the query for information past times choosing the appropriate API export option). This is somewhat painful to produce if you lot desire information scattered across a multifariousness of information sets.

I stimulate got non tested it yet, but the OECD API documentation shows an extremely useful option: you lot tin move query for information that has been changed since a detail date. For fiscal information that is non revised, i tin move only query for information afterward the end information points you lot stimulate got locally archived. However, this is non sufficient for economical data, every bit dorsum history tin move last revised. According to the documentation, those revised information points volition also last returned. This greatly reduces the burden for both sides for an automated update of information sets.

The major challenge of such piece of job is non the raw query, rather marshaling the information thus that it tin move last imported into a database. You bespeak to map the serial meta information to a whatever the retrieval machinery used inward your database (typically only about internal heart system). The argue why you lot desire to operate a unmarried packet for interfacing amongst all the information sources is that it easier to laid upwards the mapping code, which volition (hopefully!) piece of job for all the supported information providers.

One argue I pose this upwards was that finding this packet took only about time. Before I flora pandaSDMX, I looked at a few other options, in addition to all of them had defects. Since I did non desire to waste matter my fourth dimension seeing whether at that topographic point were fixes for the problems I ran into, it may last that my determination to refuse them was unfair. As a result, I volition non cite whatsoever names amongst regards to the other options.

(c) Brian Romanchuk 2018

No comments