Assessing Data Quality

Assessing Data Quality

Joleen, Ter Yang, Amy, Chei Oh (sp?) w/ Malaysian Garden Birds, Tashi from Nepal, Marcus, Andrea

Challenges:

Ter Yang - Horshoe crabs with NSS, usually we are talking about scientist-generated protocols with volunteers contributing. In his project, situation was different. Didn't start w/ "properly" defined protocol, grew more organically. Started with birdwatchers who reported HC's stuck in nets, started with passionate people interested in saving the crabs, but they were releasing hundreds each week, so it moved then to counting the number of crabs being freed. And they noticed some were larger and others were smaller, then they start measuring size. So there's a corpus of data, but they didn't know how to analyze it until there was a scientist to help, the project's protocol was in place prior to formal researchers in place.

Chei Oh (?) - Garden Birds program, started with collecting data, but didn't know what to do with it. Only in the last 4 years, got an academic involved who could help w/ analysis. Found they can do some things with the data they already had, but not as much as they would like.

Ter Yang - often NGOs start these projects without a clear question to answer. Data aren't useless, but don't lend themselves to rigorous analysis. Right now, more scientifically trained people are among the volunteers, so they are working on a protocol and specific questions to address. So data collection is getting more systematic. It doesn't mean prior data are useless or should be discarded, but it's figuring out how to use them best.

Amy - From a different angle, it's less on data collection, more on data analysis. Difficulty of bringing data sources together, it's painful and difficult. In a way, it's politically awkward too -- don't want to ask groups to work together to make it easier to integrate data, but it really would make a big difference. Also challenging with some groups who are protective of data and their constituencies, but makes it harder for them to use the data and produce what they want from it.

Interoperability issues - often political. Water quality often has this problem. Hard to know

Tashi - Jim Sanderson uses machine learning, he's a university professor, on the location data for snow leopards. In Manang (sp?) there are 13 sites, 5x5 km each, with camera traps. In Nepal, they capture and send the data for analysis. The ML uses this to predict another area to look. It's still early so they don't know if there will be problems with quality. Supposed to cover all the snow leopard range eventually. ML could potentially backfill data if they have to change the protocol. But there's no ground truth -- no one knows the actual number.

Ter yang - accepted accuracy for ML? Depends on purpose, fields, etc.

Marcus - Depends on goals, ML usually trying to compare to other methods so far, and incrementally improve.

Ter Yang - if analysis is really driven by ML, will science become even more out of touch, because it's too far removed and too abstract? Have been told that it's best for scientific studies to use simple statistical analysis, which shows robust data collection design. Only flawed protocols or data is imperfect enough that you use bootstrapping and complex analyses to deal with problems.

Geospatial data is already problematic due to bias, always will be.

Joleen - false reporting could come in, like if students are trying to be funny. Volunteered for international coastal cleanup Singapore, students who did cleanup had to go for community involvement, knew that these are not good students. Saw that they had reported the same proportions of something, looked wrong, reported it to the teacher, but not interested in dealing with it. Issue of motivations to do it right. What about different effort between groups? One group is assiduous, one group is lazy. Maybe the way the groups are composed would change this -- assignment versus self-selection.

Chei Oh - bird counts with school kids, don't know how teacher divided groups, got weird data, and it was primary school kids, they don't want to do data entry. Kids were in 4 groups, did count, submit results, then teacher keyed in the data. Weird things in the data. Teacher apparently created the problem, tried to combine 4 data sets from 4 teams, and that caused problem. Also found experienced birders didn't want to do 10-minute point count in their garden, that's why they relied on the schoolkids. So they provide a number of rules to keep it straightforward, but still have issues that are hard to nail down.

Ter Yang - seems unrealistic to rely on very accurate data from volunteers, isn't this why people criticize validity? Data aren't high quality, so conclusions are questionable. At the same time, not realistic to expect extremely high quality data, maybe they can't tell the difference between similar looking species.

Chei Oh - their project will contact people to ask about strange data, but you can't push too hard or you lose them. End up comparing apples to oranges.

Scaffolded participation might help with this situation, like going from FeederWatch to eBird. FeederWatch is the basic entry-level project, while eBird is the advanced version. Maybe new birders would be into it enough to be a good target for the protocol?

Chei Oh - people are only novices for 1-2 years, after that, either they stop or they get good enough that they don't want to do a garden point count. Also running into issues with # of years at a single site. But maybe trying a 15-minute count instead of 30-minute.

Marcus - maybe have an experiment with some people marking half the list (15 min) then adding another 15 min data after that, and compare to people doing only 15 minutes?

The experiment itself could be interesting to volunteers, maybe they would do it to verify if they can potentially reduce the time commitment for participation, or understand how the methods impact the results.

Marcus - doing beach transects for intertidal monitoring, 300m, always have one person more experienced than the other. So they would check each point and collect data, both reporting lists of organisms and also photos. When compiling the results, even with 4 sessions of training and a refresher, the lists were not reliable regardless of the level of experience, but the photos were. If they were going to do it again, instead of 8 pages of species that's fairly comprehensive, they might pick 5 or 6 target species that are more identifiable.

Also maybe try putting the images into Zooniverse project builder, can we get families at least? Similar projects like Snapshot Serengeti and SeaFloor Explorers have parallel features/components to consider. Target species tend to work well, narrows the search space for ID and you can tell people much more specifically how to ID the species.

Chei Oh - accuracy of the data, most of the contributors have no binoculars. This creates an issue for accuracy of counts.

Some discussion of audio monitoring options, recording, etc.

Ter Yang - Q about what citizen science entails when we're talking about collecting data. Say you're economist or behavioral scientist. In a sense, they're generating data for you to analyze, is this citizen science? Or in public health, I want to know the walking patterns, have them do activity reporting with a FitBit

Danger of bias in the data, maybe the fitness buffs are the only ones participating. Could possibly ask for more details to help tell about that. Also the gray areas about being a subject versus a contributor.