Identifying original file author/file fakery

Ducked · June 28, 2019, 2:53am

Is there any way to do this without access to the full resources of the Langley Cryptography Unit?

I’'d like to be able to nail students strongly suspected of faking and/or copying the results of an environmental remote sensing survey, exported from the application i-Tree Canopy as a .csv file.

I missed a trick in not requiring the submission of the native application file, and not putting some kind of “secret signature” on my example .csv file.

Probably too late to fix it this time around, but I’d be interested in countermeasures for any future projects.

I thought the author shown under security>details in Windows Explorer might help, but it doesn’t seem to be consistently recorded/displayed.

finley · June 28, 2019, 3:06am

I don’t quite understand the scenario, but there’s typically no way to determine if someone has plagiarized part of a file, ie., cut-n-pasted. The usual method of detection is to simply Google the suspect phrase and see if it turns up in an online publication.

Ducked · June 28, 2019, 5:21am

Google won’t do the job in this case.

Its a .csv data file, containing land cover classes and associated GPS co-ordinates. Even if a Google search could find such stuff, it won’t be on the Internyet anyway.

finley · June 28, 2019, 5:34am

I realise that - I was just pointing out there’s only one way to find out if someone’s plagiarized something and that’s to find the source document. Only in very rare instances can you “watermark” something such that it survives a copying process.

I’m not quite sure what it is you suspect them of or how the situation arose. You gave them this text file, and you suspect they’ve copied chunks? What were they supposed to do instead? Why do they have this data set in the first place? A bit more context might help, but IMO the only solution here is to not lead students into temptation, since plagiarism (or making stuff up) is not considered particularly sinful here.

Ducked · June 28, 2019, 5:50am

Disagree.

The solution is to lead them into temptation and then make damn sure you nail them for succumbing to it, and the last bit I have failed to do.

I’ll try and do better next time.

the_bear · June 28, 2019, 7:16am

They’re supposed to create their own data set through their own surveys. Instead they’ve used the teacher’s data set in their report production. But he can’t prove it since he didn’t ask for the native data sets to be included in the returned assignment.

RickRoll · June 28, 2019, 7:24am

Can’t he ask for it afterwards?

finley · June 28, 2019, 7:26am

That’s what I read. But in that case, surely it’s a simple case of spotting the same figures in the student’s document? I don’t see how they could have submitted any sort of sensible report without including their own raw data. Standard procedure would be, if you can’t actually include the data itself in a scientific paper (because there’s too much of it) you make it available on request, or publish it online for download. They should get used to following this sort of procedure.

the_bear · June 28, 2019, 7:58am

i-Tree Canopy randomly generates sample points so every person would have a different final report even though they input the same raw data.

Ducked · June 28, 2019, 8:15am

Nope, I fixed that. Gave them a standardised set of points. Thought I was being pretty clever.

Then I screwed up by not requiring them to submit the native application output, (which I think would be hard to fake). DUH!

Too late now.

At least one of the “usual suspects” has another person in the class named as author in Windows Explorer>Properties>Details.

Maybe that’s enough to nail them, Not sure.

Ducked · June 28, 2019, 8:31am

Dont see how. Pretty trivial to edit a spreadsheet.

One possibility would to compare their Group Report data with their individual data.

Their group report is supposed to be a comparison of my 500 point classification at T=0 to their 500 point classification at a later time.

Their 500 point classification is a merge, in Excel, of their individual 100 point classifications, so they should be the same.

However, where a group contains lazy fuckers, the other group members might fill in for them by re-surveying, to avoid lowering their group score.

I checked a prime suspect and got 47 mismatches (out of 100) which does tend to support my prejudice, but isn’t definitive, court-of-law stylee.

finley · June 28, 2019, 9:03am

I’m still not quite sure I understand the nature of your experiment, but if you observe or suspect a correlation between your data set and theirs, then can you test for statistical significance - ie., how likely is it that that correlation arose by chance? t-test?

Mick · June 28, 2019, 9:13am

What happens when you right click the file, select “properties” and then click the “details” tab. Usually has some metadata about the file or computer or program show up.

hansioux · June 28, 2019, 9:18am

If it’s an xls or xlsx file then yeah, you might have some metadata, but a plain csv file is just a text file, with commas separating each data point.

There is absolutely no way of telling if someone has copied and pasted the answer to that said file.

Shaun008 · June 28, 2019, 9:43am

They can just claim to have to have used another persons computer

Ducked · June 28, 2019, 2:07pm

Not really an experiment, its an environmental change survey. Land cover classification at 5 different times (me and 4 groups) in an area that has had a big disturbance.

I suspect some of them of falsifying their data, because the results are in some cases unlikely, and because there is a correlation between the unlikely results and the producers being lazy and/or stupid, and/or having a history of cheating in tests.

If the results are faked they need not necessarily be based on my file, though that is a convenient starting point.

I doubt statistics would give forensic grade evidence, and since the data is categorical I’d think you’d need to use…er…non-parametric (?) tests like Chi-square. If I’d had time to introduce stats for this project that’s probably what I would have done. Maybe next time.

I think I’ll probably have to take them at face value, a pity since I know a lot of the students worked quite hard on this and they’ll get graded on the same basis as the lazy fuckers.

Ducked · June 28, 2019, 2:15pm

I did that a couple more times, once with a suspect (84% mismatch) and once with a non-suspect (93% mismatch) so either faking is general, or there’s something wrong with that (or the survey) procedure.

Anyway, grades have to go in before midnight, so i’ll just have to accept it for now

Ducked · June 29, 2019, 2:35am

[quote=“finley, post:8, topic:181187, full:true”] Standard procedure would be, if you can’t actually include the data itself in a scientific paper (because there’s too much of it) you make it available on request, or publish it online for download.
[/quote]
I rather doubt that is actually standard procedure, though of course it should be. Certainly when I worked as a research assistant when I first got here, that was never done for the papers I helped get published, and in some cases I’d suspect the raw data didn’t exist.

finley · June 29, 2019, 2:59am

If you don’t publish your data, your research is meaningless. The whole point of publishing is so that people can attempt to reproduce your results and/or refute your hypothesis. Proper scientists always make their data freely available somehow (although not necessarily to all and sundry) so that - for example - other people can perform alternative analyses on it.

But yeah, there are plenty of pretend-scientists (like nutritionists) who can get away with just publishing their conclusions and expecting a round of applause. They’re publishing for an audience of other non-scientists, so their little happy-clappy charade works out just fine for all concerned. Mostly, though, scientists who refuse to release their source data have something to hide. Scientific fraud is more widespread than scientists would like to admit.

Ducked · January 9, 2021, 2:37am

Doing this again, though for a different survey area. Don’t have any clever tricks to prevent fakery other than requiring submission of the native format project file from i-Tree Canopy. I’ll do some spot checking and try and build credibility and consistency scores into the grading.

This avoids outright accusation of fakery, which should be an automatic fail so would need to be definitively proved, and would reinforce my existing troublemaker image with the authorities…

Last time I gave them soft copies of an example report of a survey I did of siltation and vegetation in the Chernobyl Cooling Pond, and in some cases got it right back as their report, with the title changed, so this time they are only getting paper example copies.