Lesson 2: Geospatial Project data

Description

Guidance and examples for building location (geospatial) information from administrative and project data to be used in a geospatial impact evaluation. Topics include: geospatial data basics, units of analysis, preparing project geospatial data, project metadata

Transcript

All right. So as I mentioned the goal of this section is really going to be to familiarize you with some of the the core concepts related to you, preparing your project data for geospatial analysis. And these two I's
I'm. Going to focus on a few different topics, such as defining your units of analysis, preparing or converting your raw project, documentation and other sources of data into a geospatial data format, and then linking that to your spatial data with other project information you might have on hand.
So before we dive into those topics. I wanted to spend a few minutes briefly looking at some of the geospatial data basics that are involved. This is going to include geospatial data, types, characteristics, and spatial operations we might be using, and some of the tools you can use for working with spatial state in general.
Now, its most basic level. Geospatial data is really just any information tied to a location on Earth.
There are a lot of different ways to represent you spatial data and really nearly infinite different things which you can connect to locations. Geo: Spatial data can include satellite imagery, country boundaries, household survey locations, social media posts that were Geo. Tag locations of development sites, road networks, many other things
a while those all come from very different sources. The really great thing about Geo. Spatial data is that they're all connected through this common frame of reference, because they all just have a geospatial element to them. They're all somewhere on their earth.
That means we can emerge all these different types of information together and explore relationships, perform analysis in ways that simply aren't possible without the spatial data, this this core connecting element.
So a few minutes we're going to revisit some real examples of different types of spatial data from the from past years. But for now it's one to give you a a sense of some realistic ways. You might encounter geospatial data for development projects. You could have Gps coordinates, some collector and household surveys you could have
uh field workers who locked feel plot corners, or you know, routes of dams that were developed. You could have worked with implementers or construction firms who were body with paper maps that show the the extent of project sites,
and, as I mentioned before, as varied as all these different sources could be. They're all ultimately just tied together by that locational information.
So the next step is understanding how we represent those locations and related information in a standardized usual format.
So
in general, all Geo spatial data boys down to two core formats. The first is vector data which includes points, lines, and polygon features, as you see here. Points are just what you think of as coordinates and X and Y location lines are a series of those points that are connected. This would be something like a road or a stream,
and then polygons are just a coach closed shape represented by boys. So this is a building administrative unit, something like that.

just so for recapping the vector you stuff I was talking about the points, lines, Polygons here.
Um! Some of the five common file formats you might use to store this vector data you might have seen for shape files to Json's or Geo packages, all very common formats.
Uh, in these formats. Each feature can also have properties associated with it to store additional information about the features, such as an Id or associated measurement types like you can see on the slide. The
take yourself okay.
The second core form of geospatial data are rasters. Rasters are essentially a spatially referenced image, where each pixel in the image is tied to a specific location and has a measurement value associated with it.
Raster data is typically associated with satellite imagery and measurement data sets which Kumar will explore through more in the next session today, but including now, so that you can get a sense of, you can start to think about how the geospatial data you generate or acquire to represent product formation can be used alongside. Other data sets in raster formats such as
my timeline.
Uh, probably the most common file format you're gonna see, for raster is the Geo tip, which simply adds a geospatial element to the common tip image file format.
One of the fundamental elements of raster data that I want you guys to keep in mind is the resolution of the data. The resolution of rascaded is essentially how much physical area on the ground is represented by each individual pixel in the image. It's a critical consideration when determining whether the raster data is that you're looking at makes sense to use your project data.
So, for example, if your project data consist of small, one hundred square meter farms within a broader five square, common region, it wouldn't make sense to evaluate land use, change with a a satellite baseline cover classification represented by a a raster with a resolution of five computers.
The farms would be much smaller than the resolution of data to not see any change across the arms,
also mentioned that sometimes, even when the resolution seems course to human, I like the eight year Example: here:
um spatial algorithms may still be able to pick up meaningful trends. Ultimately, this kind of this depends on the the context of your analysis and how you tend to use data,
we'll get more into that.
So how can we use vector and rast together at a very basic level? Either format can be transformed into the other. After data can be disaggregated into raster data with user provided resolution. As you see on the left. Here we take a vector format and break it down to individual pixels, representing that feature
and raster data. Pixels can be aggregated to form vector features. As you see under.
You're not likely to dive into the specifics of these transformations, but they're the basis for a an important fundamental spatial operation.
You're almost certainly going to incorporate cultural statistics.
So all statistics allows you to take a vector representation of your project data like a project site and extract statistics from raster data measurement data such as nighttime lights that overlap with each vector feature for your project.
So in this example we have raster data measuring vegetation, using Ngbi and polygons defining project sites.
Using the zonal statistics, we can extract the average level of vegetation from all the pixels and the ras data at each project site and then export and explore those in a tabular rather than a spatial format. To do visualizations or analysis, using, excel, or any other tools you would work with for any other.
So before we move on, I just want to take another brief moment. So I mentioned some of the different tools that are available for work with geospatial data to the big desktop. Tools are Tgs, and and they're both going to support nearly any kind of visualization or geospatial operation. You might want to do
without going too deep into the differences. Qgs is free and open source. Ourts requires a license,
but both of them are really going to satisfy any type of work you need to do.
Now, if you prefer to code or need to form a lot of custom analysis, automate data processing, you might want to use programming such as Python or our languages. these are probably the two most well supported in terms of the different communities and packages libraries they have available. But there's definitely support for other languages if you're familiar with them.
And then finally, later on, Ku Kunwar is going to explore using Google Earth engine, which is a cloud based tool and sort of mixes a user interface and coding in really powerful way. There's lots of great communities and tutorials for that online as well to cover all the different kind of spatial analysis you might want to do
so. Moving on to our first main section, one of the first things you want to do when you're considering its use. Spatial impact valuation is. Think about your unit of analysis.
This is going to be the geospatial feature that is associated with each of your samples, and it could be a a field house, a village, or for itself over product area.
So a key thing here is to recognize that a spatial unit of analysis is not always as clear cut as unit of analysis and a non-spatial study.
This is particularly true when it comes to developing projects.
So you can think about many non-spatial studies. Unit of analysis, for
some for an intervention is is very well fine. It's easy to understand giving a pill to an individual for a disease. The individual person is the unit of observation,
but typically with development projects the spatial analysis be to a more theoretical basis for why you're making that selection.
So if your project builds irrigation canals which you have spatial data for you would likely want your universe to be some buffer of some size around the canals, not just the canal self to actually capture nearby impacts,
but the specific size of that buffer depends on how far from the canal do we actually expect to see or gauge the benefits
you make your offer? Only five meters around the canal. That's not going to cover a whole plot, so you make it one kilometer, five kilometers.
Ultimately the spatial definition of your unit of analysis has a potential to impact outcomes, the types of data choices you make to include in your models
your results, and the implications that you're finding.
In many cases you may actually end up testing multiple units analysis to look at different outcome metrics or ensure your results are plus explore variation in impact at varying distances of project sites. So you look at whether those irrigation projects have more impact.
The one to five kilometer range versus the five kilometer range.
So what I have here is, it's not an exhaustive list of potential units for analysis. But I just wanted to share some of the examples of what a spatial negative analysis might look like. And for different project types.
Key element here that I want to take away is that there's not always just a single right option for what the unit of analysis should be. For a,
For example, if you want to measure irrigation improvements at the field level, having a poly on defining the exact shape of field would properly be ideal, but having it just point somewhere in the field can also be sufficient
uh a good case where that would be enough would be. If your fields are fairly regular in size, you have a point roughly in the center of each one. You could fairly reliably use a buffer or a square around that point. So I get a good representation of the field while doing a lot less work, and actually collecting the full boundary from from field work,
ultimately deciding whether a points sufficient for your work or you need those exact boundaries, comes down to the specific analytical approaches you're using and other data you intend to use.
Uh. So as we go through the rest of the training today and tomorrow some of our passengers will give you a sense of how we've made those choices ourselves.
So let's explore some. The common types of units of analysis we might use. The first one is buffers around points, lines, or polygons for presenting project size.
In this slide I've got four examples for natural jets in which we use buffers around
road networks, canal networks, irrigated fields, and villages.
In general, buffers can be a good choice for projects where the impacts are not really constrained to a well defined area that you have a spatial definition, for
for example, development and region after a new highways built might not occur directly along the highway similar to the canal example from four.
It might not even occur within a consistent distance from the highway
buffer, which can theoretically be defended to capture development due to the highway, but not so distant on a but not distant, unrelated to backs that can be a good choice for your work.
Buffers or other area Based definitions of project sites can also be broken down into individual smaller grid cells. For the analysis
hundred cells can be a a good choice when the statistical analysis method requires more unusp observation, or what you want to measure a precise change across your project regions not just at the region level as a whole one.
That point you might be wondering why not just always use for itself and get that extra layer of information. But there are some potential, different disadvantages to consider. In addition to having to the right they offend, why you chose that Chris cell. Small brid cells can become much smaller than the resolution of the actual raster and other measurement data you're including that defines your outcome measures or other critical variables the analysis.
So even though you have that smaller unit of analysis. You're not gonna be able to pick up change if the other data you're tied to those units. Analysis isn't the same resolution.
Then one of the best units of analysis you you can have is just the existing precisely defined intervention here. Now
an example might be an irrigation project where each site served exactly three fields, and you have a polygon defining every field. Precisely
unfortunately, this level of precise geospatial data is not usually recorded for development projects and generating it from the start, can also just be costly.
Now, with broader scope projects. It might also be more useful to use the administrative entries, current area as the unit of analysis. These can be a good choice if you need to join data from other surveys or census rounds which are tied to those administrative boundaries.
One of the big challenges with using administrative boundaries is when they change over time.
So, even when spatial records of those changes exist from the countries administrative records. It can be difficult to actually account for how those schedule changes impact the project in your causal analysis.
So a Ministry of Boundaries can also vary in size substantially across the country which can't introduce its own problems depending on how on a Co. Approach.
So so far, I've primarily talked about project data that is already in a prepared to use facial data format. Just to give you an example of units of analysis,
but more often than not the actual location information you have available from project records is not ready to immediately be turned into a unit of analysis many times. It's not even any usable. Choose facial data format.
So in this section i'm going to skim through a number of different approaches to prepare geospatial data that we've used in previous days. Just to give you a sense of what is possible for working with different types of project data.
Now, i'll start off with a pretty easy example where we're provided with the precise geospatial data project sites in this gie The exact boundaries of indigenous lands were recorded to track impact of legally recognizing ownership of these lands on deforestation.
Our unit of analysis was the exact boundaries and the evaluation looked at deforestation rates before and after legal demarcation within the boundary.
Now, a more complex example, using precise project data was an evaluation of
mind clearance activities in Afghanistan. In this evaluation we had the pre precise boundaries of areas which had been cleared again. But we use several different units of analysis to ride from those boundaries to explore different impacts.
So the precise boundaries were used to evaluate land use and land cover changes. But then we use one kilometer grid cells within those boundaries to evaluate economic activity based on night time lines.
Then we use village villages proximate to those boundaries to explore impacts from different metrics, from household surveys.
Then we use large ten kilometer grid cells overlapping with the boundaries to evaluate complex,
and then, finally, district level administrative boundaries were used for a a nonprofit analysis of a distribution and economic activity.
An interesting situation that's surprisingly common with projects in the development. Space is only having images or scams of maps, and not the actual geospatial data that was used to make the map.
This is typically going to be the result of having maybe project contractors handling the mapping and gis components, and then, the resulting geospatial data files being this place, they're never fully turned over after the map images for a reporter generated.
As a result, when we have these maps. We need to extract the features shown in the map by first geo-referencing the image and then rebuilding the features.
So geo-referencing is the process of heading, and to use facial coordinate system to map by identifying ground reference points, especially essentially, you can load this image,
which isn't spatially referenced alongside. No one satellite imagery that is facially referenced, and links specific locations that you can see in both such as houses their rooms. In order to Geo reference this image,
Once the image is Geo-reference, you can then trace the features that you see in the map to recreate them in a geospatial format.
This can be labor intensive. So one of those things I i'd always recommend doing, spending a little bit extra time trying to find the clear line to your spatial data before going this Geo. Referencing and and mapping route.
But if you need to geo-reference and trace maps, we we do have a guy we've used for our own work in the past that we'd be happy to share
another common scenario, especially with doing a retrospective evaluation of much older projects, is extracting location information from text based project records. So a lot of the earlier geocoding are all mentioned for country, eight management systems in the World Bank use this this kind of approach.
So you might have descriptions of project locations that come from variety of different product documentation and government, donor contractor systems, or even from Media and other sources,
the quality and the precision of those descriptions about locations. Can you
really be highly variable? And the process of finding that location information from this document. Documentation is called Geo Parson.
Once you find that location information, it could be geocoded to translate it into a standard. Geo spatial data format.
So depending on the quality of location information documents, the exact approach to to geocoding information can vary.
If you have very limited information, it may make sense to use a simplified system where you're just dropping a point associated with the approximate location. And then you have an additional field that you use. Describe the type of location
that it was so put the points supposed to represent. So you might drop a point in the pros and location of the bill. You don't know exactly where it is, and you have a field that describes. That point is supposed to be a village,
or sometimes you might know roughly what city block a hospital was developed. And so you shop a point on the block and you label it as a hospital
with more descriptive records. It might be possible to generate more precise features, such as the exact path of the road project or polygons defining where that exact hospital is built.
The key element of either approach is utilizing these extra fields which describe the precision of the recorded feature relative to the actual project.
So if you know exactly where a hospital is, you trace the outline of the building, you would say this is the hospital, and I actually trace the hospital.
But if you know the projects the hospital, and you only know what city it's in. You might say this project represents a hospital, but I was only able to geocode the city because I don't know exactly where it is.
So that's that's really important. For once you get to the analytical stages of
leveraging that precision to inform your your analysis
that you don't have precise enough
to code of data to to perform the types of analysis you want to do. That's one of those points where you have to decide. Can I actually do what I want to do with this? With this data quality?
Uh, generally, the approximate point based approach is going to be used for for larger collections of project data without create location information
evaluations. Looking at impacts over regional scales won't really be as negatively impacted by the lower precision of the data.
So that goes back to your kind of analysis. You have this less precise Geo-odata You might need to make a and use a course of resolution. Now, um
coarser unit of analysis, in order to effectively incorporate it all into your study,
generating the precise features obviously requires much better information. It takes more time, but it also allows you to conduct much more precise subnational evaluations. Looking at local impacts.
Here, I included just a couple of images to sort of illustrate the difference between using these point based and precise approaches. So in this last slide you could see points representing individual villages where road project went through. So all we had was a text Description Describing
road was built through these eight cities or villages.
You can imagine only having those points, and not be able to do a lot of really precise spatial analysis. But if we are able to find the exact path of that road through imagery. You can do a lot more. Understand the local impacts.
More recent approach to you. Coding is utilizing existing features from open source platforms like openstreetmap. So instead of creating these features yourself using a user human nature and an effort to trace rows and other public features,
so openstream app is just the public user. Generate database. Spatial feature is describing essentially everything in the world. So buildings, roads, administrative zones, parks, bus stops.
Really anything else. You go out in the city and see It's probably recorded here.
Coverage does vary around the world based on user activity. But these days it's it's pretty good and most most areas
we've leveraged openstream app data to geocode AidDatas a global Chinese development finance data
uh the main, I mean, Cabot here is using this approach. is it is it really needs. It works best for projects that describe something with a physical footprint like a road we're building something that is more like a
community health program that doesn't have a specific area. It's sort of widespread. It's hard to use overstream for that,
that, said There is some flexibility with, you know, openstream app or less specific features, and you can contribute your own features when they don't access.
The last approach to generating geospatial features from project data is actually going out and collecting the choose facial data. So
there's a lot of different ways of doing this, but it essentially in about sending people in the field to record Gps locations or walk Bats
image. Here is the result of recent field work testing a few different of these approaches to Mapping Farm Plaz, and that
so the less precise one is actually just collecting quarters of plus with the Gps and the Green one is having a field worker trace the the specific outline of those skills.
So a key point I want to make before moving on to the next session is that, However, you generate your data, whatever you find is best. It's critical that you have a unique identifier that can be associated with the project feature.
This might come from project records or be an arbitrarily generated number, or follow some methodology to your data and evaluation.
For example, if you have a precise project site which you are testing with variable buffer sizes and then splitting into grid cells
your unique. I can combine a project site. Id from your project records the size of the buffer, and then the grid cell number.
The important part is that you're just consistent, and you use this identifier for all different sources of data associated with the project that allows you to link it all back together.
So on that note about joining other project data to your geospatial data, we'll briefly discuss some project metadata before I take some of your questions.
So metadata in general is really a broad topic that varies depending on your project, and can cover any additional information tied to your project sites that you'd want to include alongside your geospatial project data.
Uh few examples are: Project date start end days implementing agencies, project costs, or other financial information.
The thing I want to know, the key things I want to know are all really just aimed at making your lives easier when collecting this data. Sometimes all the product, information, or metadata are already available in a clean tabular format. You can just merge with your to use spatial data on some unique id.
But sometimes you might need to extract that information from project records like the same
uh text documentation we use to find location information might have
interspersed information about the the start date.
In all cases I suggest, first determining what project information is already available, and what you need to extract,
to avoid having to go back through and search multiple times through documentation,
then determined approach to extract and standardize that all of the information.
And when documentation across the project berries perhaps due to different implementing partners standing as standardizing that data is going to be critical.
Finally, and perhaps most importantly, as I mentioned before, always use that unique identifier that's going to tie your metadata to your geospatial data.
I had here just a table which represents a portion of the metadata from a recent with fid field, represents the unique identifier that links the metadata table to the geospatial data
additional fields. We've included. Here are the type of project, the site, name, and the year it was completed, and the size of the project site
for this subset of the project sites, and then she had to meditate. It was already compiled in a spreadsheet. But for other sites we had to actually extract that project for information from Pdf. Copies of maps like I mentioned before.
So here you can see an example image from a Pdf. Map of a project site that contains some additional information about the canal project.
So, in addition to geo-referencing and tracing the canal root on these maps, our team also had to core, extract and record the metadata from this Pdf. Into a spreadsheet.
So we have the canal product. Id which was our unique Id for both the geospatial data and the metadata, which we hopefully we're able to use to join that spreadsheet and the to spatial data.

‍