How we constructed the world’s largest dataset on China’s overseas spending

A behind-the-scenes look at AidData’s latest methodology to track underreported financial flows from China to the developing world.

April 12, 2022
Kyra Solomon, Bradley C. Parks
A vessel arrives at the Port of Djibouti in October 2019. The port was partially owned and operated by DP World and China Merchants Holdings, until its container facility was seized by the government of Djibouti in February 2018. Photo by Davide Monteleone, all rights reserved; used by AidData with permission.

A vessel arrives at the Port of Djibouti in October 2019. The port was partially owned and operated by DP World and China Merchants Holdings, until its container facility was seized by the government of Djibouti in February 2018. Photo by Davide Monteleone, all rights reserved; used by AidData with permission.

Shrouded within complex financing agreements and buried in difficult-to-retrieve online sources are the facts and figures of China’s overseas development program. Ten years ago, AidData embarked upon a journey to understand its true scale, scope, and composition. 

China does not voluntarily disclose information about its aid projects through international reporting systems, such as the OECD’s Creditor Reporting System or the International Aid Transparency Initiative. Nor does it publish detailed information about its overseas lending activities. 

Beijing’s lack of transparency inspired AidData to develop a rigorous, replicable, open-source methodology that would produce detailed financial, operational, and locational information about Chinese government-sponsored development projects around the globe. 

First developed in April 2013 to construct a dataset of official sector financial flows from China to  Africa, the Tracking Underreported Financial Flows (TUFF) methodology is the foundation upon which AidData has built the world’s most comprehensive dataset of official sector financial flows from China to all major world regions. The latest (2.0) version of the dataset, released in September 2021, captures 13,427 Chinese government-financed development projects worth $843 billion that were approved from 2000-2017 and implemented from 2000-2021 across 165 countries. Beyond the sheer number of projects and topline dollar figures, the dataset provides more variables (70 for each project) and wider geographic coverage than any other available source. Banking on the Belt and Road leverages this unprecedented dataset to provide a bird’s-eye view of China’s geo-economic strategy before and after the introduction of the BRI in 2013.

Since the TUFF methodology was introduced nearly a decade ago, the Chinese Development Finance Program (CDFP) at AidData has collected extensive feedback from external users of the data and continually reviewed the internal procedures and systems that support the data collection process. Based on these inputs, we have made four rounds of far-reaching improvements, revising and extending the methodology in September 2015, January 2017, October 2017, and most recently September 2021. We have refined our sourcing procedures, added new variables, improved the coverage of existing variables, and created more detailed coding and categorization guidelines. These improvements, which are chronicled in a new book called Banking on Beijing, have created a foundation for the construction of what we believe to be the world’s most comprehensive and detailed dataset on China’s overseas development finance program. 

How has the methodology and resulting dataset been improved?

The CDFP team—ourselves (Brad and Kyra), Ammar Malik, Brooke Russell, Brook Lautenslager, Joyce Lin, Katherine Walsh, Sheng Zhang, Thai Binh-Elston, and a large group of intrepid student researchers—uses the TUFF methodology to systematically review millions of official and unofficial sources in more than a dozen languages. 

These sources provide information about projects funded by more than 300 official sector lenders and donors in China, including government ministries, state-owned banks, and state-owned enterprises. Targeted searches are conducted at every stage of the data collection process, and information is corroborated via triangulation across multiple sources. Projects are researched holistically to understand their aims and achievements, and special care is taken to ensure that missing information is minimized for key variables (such as financial commitment amounts, interest rates, grace periods, maturity lengths, sector codes, project implementation start dates, and project completion dates).

The latest version of the TUFF methodology more effectively accounts for the policies and practices of official sector grant-giving and lending institutions in China. It also relies more heavily on official sources, all of which are fully disclosed in project records. These sources include financing agreements published in government registers and gazettes; official records extracted from the aid and debt information management systems of host countries; annual reports published by Chinese state-owned banks; Chinese Embassy and Ministry of Commerce websites; reports published by parliamentary oversight institutions in host countries; and in some cases, our own direct correspondence with finance and planning ministry officials in developing countries.

Chinese Official Sector Agencies 

Capturing the who, what, when, where, and how of projects

On average, one project record in the new 2.0 dataset is based on seven unique sources, and 89% of the project records are underpinned by an official source. In total, the dataset relies on 91,356 sources. The quantity and quality of the sourcing that underpins the dataset has created new opportunities for analysis. These include:

  • the specific terms and conditions that are included in grant, loan, debt forgiveness, and debt rescheduling agreements; 
  • the specific organizations that are involved in the financing, design, and implementation of projects; 
  • The precise dates of project commencement and completion; and 
  • the exact geographical locations where projects take place.

The latest version of the global dataset identifies 1,600+ interest rates, 1,900+ maturity lengths, and 1,200+ grace periods across 3,103 loans. No other publicly available dataset provides this level of coverage on Chinese loan pricing.

The implementation of the TUFF 2.0 methodology has also generated substantially more detail about the specific organizations that are involved in financing and implementing Chinese development projects. For each project, we now seek to provide information about five different types of organizations: 

  • the official sector institution in China that is responsible for providing funding and/or in-kind support for the project; 
  • the co-financing institutions inside and outside of China that are supporting the same project; 
  • the recipient/borrower institutions that are responsible for managing incoming funds and in-kind transfers;
  • the contractors and subcontractors that are responsible for project implementation; and 
  • the third parties that provide repayment guarantees, credit insurance policies, and collateral which can be seized in the event of default. 

The dataset includes 334 official sector funding institutions in China, 460 co-financing institutions, 2,450 recipient institutions, 3,523 implementing institutions, and 227 third parties (or “accountable agencies”) that provide repayment guarantees, credit insurance policies, and sources of collateral. This level of detail enables analysts to study the inner workings of Beijing’s overseas development program and the organizational networks that have enabled China to become an international financier of first resort for low-income and middle-income countries.      

Granular information about when and where Chinese development projects have taken place is also available: 5,539 projects (worth $438 billion) have precise implementation dates and 6,061 projects (worth $333 billion) have precise completion dates. We also record the originally scheduled project implementation start and completion dates, enabling analysts to determine if projects have been implemented on, behind, or ahead of schedule. 

What’s more, for the 3,285 projects that have physical footprints or took place in specific locations, the new dataset extracts point, polygon, and line vector data via OpenStreetMap URLs and provides a corresponding set of GeoJSON files. These files detailing the precise geographic boundaries of projects can be downloaded from our Github repository.

Another highlight of the new dataset is that it allows users to drill down and understand how Chinese development projects are financed, designed, and implemented in practice. 

Take for example collateralization: the dataset shows that nearly half (44%) of official sector lending from China is collateralized. However, the individual records in the dataset allow users to drill down further and identify the specific source or sources of collateral that a lender can seize in the event a borrower defaults on its repayment obligations. These types of granular data make it easier to subject popular claims—such as the idea that Chinese state-owned lenders ask for physical, illiquid assets like ports and electricity grids as sources of collateral that they can seize in the event of default—to empirical scrutiny. 

AidData’s recent analysis of a $200 million China Eximbank loan for the Entebbe International Airport Upgrading and Expansion Project is a case in point. Given that the TUFF methodology involves the retrieval of the original loan contracts signed by Chinese lenders and borrowing institutions, we were able to demonstrate with contractual evidence that Entebbe International Airport—a physical, illiquid asset—is not itself a source of collateral that the lender can seize in the event of default. Instead, China Eximbank required its borrower, the Government of Uganda, to provide a fully liquid source of collateral: a cash deposit in an escrow account that the lender can unilaterally seize in the event that the borrower defaults on its repayment obligations.

AidData’s publication of the original, unredacted versions of the financing agreements that underpin Chinese development projects has also opened up new opportunities to analyze subsidiary on-lending arrangements, the use of special purpose vehicles registered in offshore jurisdictions, escrow account requirements, the cost of credit insurance from Sinosure, the consequences of default and other breaches of contract, and dispute resolution mechanisms, among many other things. 

The dataset itself is also a rich source of qualitative information that can be used for further analyses. Our team has taken special care to assemble “cradle-to-grave” narratives that tell the story of each project in a coherent and consistent structure. They provide detailed information about how Chinese development projects are being implemented in practice and why they have succeeded, failed, or faltered. We have made all of these narratives available via, and they are embedded in our main dataset. Across all 13,427 projects in the dataset, the project narratives include the same number of words that one would find in 19 full-length books! 

Consistent with our mission of providing rigorous and policy-relevant evidence that can make a difference in the world, we believe this dataset will open new opportunities for analysts and decision-makers to better understand the aims and impacts of China’s overseas development finance program.

Kyra Solomon was previously a Program Manager on the Chinese Development Finance Program team.

Brad Parks is the Executive Director of AidData at William & Mary. He leads a team of over 30 program evaluators, policy analysts, and media and communication professionals who work with governments and international organizations to improve the ways in which overseas investments are targeted, monitored, and evaluated. He is also a Research Professor at William & Mary’s Global Research Institute.