Since the January 2014 release of the 1.1 version of AidData’s Chinese Official Finance to Africa dataset, our Tracking Underreported Financial Flows (TUFF) team has substantially refined its methods and expanded the scope its data collection efforts. Last week, we released our 1.2 dataset and TUFF methodology updates, reflecting these changes. So, what are the five most important changes to come with these updates?
Sourcing new and different types of information to create “better triangulated” project records
The TUFF team has invested significant effort to ensure that it identifies as many information sources as possible for each project in the dataset. This has resulted in a sharp increase in the total number of sources as well as a diversification in the types of sources that underpin the project records in the dataset. In the original 1.0 version of the dataset, we relied on media reports for 89% of all sources (see Figure 1). Recognizing the importance of reducing our reliance on media reports, we amended the 1.1 version of the TUFF methodology to more systematically integrate other types of open sources. This resulted in a sharp reduction in media source reliance -- from 89% to 68%. The 1.2 version of the TUFF methodology and dataset goes even further. In the latest (1.2) version of the dataset, media reports represent only 54% of all sources.
Official government data and documentation from China, counterpart countries, and international organizations now constitutes 30.5% of all citations. Peer-reviewed journal articles and other academic publications represent 8% of all citations (up from 1%).
Additionally, we have increased the average number of sources per project. In the 1.1 version of the dataset, there were 1954 project records and the average project record relied upon 2.13 information sources. The 1.2 version of the dataset contains 2647 project records and the average project relies upon 3.17 information sources.
Figure 1: TUFF Information Sourcing Over Dataset Releases
Identifying and correcting gaps in the dataset
Our team is not only committed to identifying and filling data gaps, but also continuously improving the TUFF methodology to prevent these gaps from reappearing in future iterations of the dataset.
Our data on Chinese health aid activities is one example of a data gap our team corrected (between the 1.1 and 1.2 versions of the database). In 2014, China’s Ministry of Health and several colleagues based at universities in China alerted us that our dataset did not contain comprehensive information about in-kind, medical supply donations recurring visits from Chinese medical teams. Shortly thereafter, we developed new data collection procedures to address the information gap, and we were able to increase the number of Chinese health projects -- from 219 in the 1.1 version of the dataset to 486 in the 1.2 version of the dataset.
Developing data quality scores to improve data transparency
In our version 1.2 dataset, we have introduced two different types of project-level, data quality metrics: a source triangulation score and a field completeness score. The source triangulation measure varies from 0 to 19, with higher values representing a project record that draws upon a diverse set of information sources. The field completeness measure varies from 0 to 9; higher values indicate that a higher percentage of the “fields” (i.e. variables such as transaction amount, flow type, and commitment year) for a given project record are complete.
The purpose of these data quality metrics is to help external users and members of the AidData team to distinguish between project records that are more and less reliable and complete. Given that the TUFF methodology relies on open-source data collection, variable data quality is inevitable. The source triangulation and field completeness scores make it significant easier for users of our dataset to identify those project records that should be used with caution. They also allow members of our team to prioritize projects that merit further investigation and validation in future rounds of data collection.
Ensuring consistency between Beijing’s official policies and our classifications of Chinese official financing
In the 1.2 version of the TUFF methodology, we have integrated several new quality control checks to ensure that the way we represent Chinese official financing is consistent with Beijing's official policies. For example, since China Development Bank has a stated policy of offering non-concessional loans for commercial purposes, we now assume that no financing from the CDB meets the developmental intent criterion for official development assistance. This “code by rule” approach is a departure from past practice; coders previously did not make an ‘intent” determination – needed to differentiate between official development assistance (ODA) and so-called other official flows (OOF) -- without explicit information about project’s commercial, developmental, or representational intent.
Another important coding rule we have adopted relates to the calculating the concessionality of loans from China Exim Bank. While not all loans from China Exim Bank are concessional, several new publications by the Chinese government (and conversations between AidData staff and Chinese government officials) have revealed that China Exim Bank offers its concessional loans with the following conditions: a 2-3% interest rate, up to a 5-7 year grace period, and a 15-20 year maturity. If one uses these values to calculate even the most conservative estimate of the grant element of a China Exim Bank concessional loan, it is easily surpasses the 25% concessionality threshold needed to qualify as ODA. , As such, in the 1.2 version of the TUFF methodology, we have adopted a coding assumption whereby all concessional loans from the Export-Import Bank of China with development intentare coded as ODA-like.
Introducing automated quality checks to improve data standardization
Given that a large number of researchers work together to generate AidData’s Chinese Official Finance to Africa dataset, there is a non-trivial probability that idiosyncratic coding decisions made by individual researchers will result in inconsistencies within project records. While AidData puts forth its best effort to train researchers to code projects in a consistent and systematic manner, human error is always a possibility. To protect against this source of bias, our team has developed a series of automated data quality checks.
By way of illustration, consider the task of assigning Chinese medical teams the correct “flow type” designation. One group of researchers might think to code Chinese medical teams as grants, while another group might decide to assign these teams to the “free-standing technical assistance” category. Our automated data quality checks override these idiosyncratic coding decisions, assigning all medical teams to the flow type category of “free-standing technical assistance”. Similarly, in light of the fact that all Confucius Institutes have an element of cultural promotion, we have introduced an automated data quality check to ensure that all such activities are (a) assigned to the “OOF-like” flow class category and (b) coded as having “representational” intent (see Appendix N in the 1.2 methodology for a full details).
What's Next From The TUFF Team?
We are pleased to announce that the TUFF team will be releasing an updated version of its Chinese Official Finance dataset that covers all major regions, not only Africa, sometime in Q1 2016. Be sure to follow AidData's Twitter and Facebook accounts for further announcements regarding its release. In the meantime, our TUFF team remains committed to improving our data and methods, so if you have your own suggestions you'd like to send along, contact us at firstname.lastname@example.org.