More PACE trial data released

Barry · May 12, 2019

Lucibee said:
Without confirmation, it is not safe to trust that the records are in the same order for every file. Without knowing which groups the data were in, they are pretty much useless, apart from providing a summary of the entire cohort.

Only had a short while to look at this, but I think it is possible to reliably align the original FOI dataset with the new ... unless I'm being silly and missing something.

The new dataset is a superset of the previously released dataset so includes the same columns of information, but the rows being reordered. Each row being a particular participant of course. But something like wtms0 and wtms.52 are pretty unique values for each participant, so will likely act as surrogate participant IDs. I tried the following sort on both datasets and at a very quick glance then looked to then align OK. But I've not checked this rigorously. Even if there were a few duplicated rows, the likelihood should fall if more sort levels were added.

Or have I misunderstood what the problem is?

Lucibee · May 12, 2019

Barry said:
Or have I misunderstood what the problem is?

Yes. You have misunderstood what the problem is. This is why this is so dangerous!

The new dataset was NOT a superset of the previously released dataset.
The dataset posted on here was generated based on the assumption that the data were in the same order. That assumption was INCORRECT. So that dataset is flawed and is contaminated. *klaxon*

QMUL only provided the requested data, all of which are new data (well, at least to us). There was no previous data on which to do any successful merge.

Barry · May 12, 2019

sTeamTraen said:
There were no common variables.

All I was able to do was merge the columns, on the assumption that the participant order was the same in each case.

Ah ... I see @Lucibee's point now. That's a bummer. The fundamental issue is really that these data releases do not include a unique participant ID (anonymous of course), such as a relevant database key or suchlike; every data release needs to include the same ID parameter.

ETA: I also now appreciate @Lucibee's point that this merged dataset absolutely must not be distributed, because the data is likely incoherent. All manner of screwed up analyses could get proliferated.

Barry · May 12, 2019

Is it possible to have a copy of the unmerged dataset from this recent FOI data release please @JohnTheJack, as supplied to you? I know I could probably infer it by removing the data columns pertaining to original 2016 FOI release, but would prefer not to.

sTeamTraen · May 12, 2019

Barry said:
Is it possible to have a copy of the unmerged dataset from this recent FOI data release please @JohnTheJack, as supplied to you? I know I could probably infer it by removing the data columns pertaining to original 2016 FOI release, but would prefer not to.

The latest release came in 5 files, with names that began with the digits 2 through 6 (perhaps because 1 was the first file released? I wasn't around then).

The variables in the new files are as follows:
File "2 pace_EQ5D": eq_index.0, eq_index.24, eq_index.52
File "3 pace_HAD": HAANT.0, HAANT.24, HAANT.52, HADET.0, HADET.24, HADET.52
File "4 pace_WSAS": WSAT.0, WSAT.24, WSAT.52
File "5 pace_BORG": STBOR.0, STBOR.24, STBOR.52
File "6 pace_PHQ15": PSTOT.0, PSTOT.24, PSTOT.52

You can reconstruct those files, exactly as I received them from John, by saving just the appropriate columns from the "omnibus" file that I assembled. That is, I did not do any sorting on these files, because I knew that I had to assume that they were sorted in the same order (of participant number) as the original file.

I can also mail the files to you, if John isn't immediately available: nicholasjlbrown@gmail.com.

sTeamTraen · May 12, 2019

Lucibee said:
As MS would say, "read the paper!" - in particular, the footnote to table 3 explains that the comparisons shown are for the final adjusted models.

OK, done that.

I see now.

While I was continuing to reproduce Table 3, adding the number of people who improved, I decided to see whether the change of criteria (from Likert <= 18 to binary <= 3) benefited CBT and GET. It seems that the opposite was the case: The percentage of people either recovering or improving (using a change of 2 Likert points or 1 binary point, the latter of which I assumed would have been the criterion had they kept the binary version of the fatigue scale) was higher for APT and SMC.

Rich (BB code):


TX    Improv-Likert    Improv-binary    %incr
APT              99               65     52.3
CBT             113               87     29.9
GET             123               87     41.4
SMC              98               62     58.1

TX    Recov-Likert    Recov-binary    %incr
APT             34              17    100.0
CBT             60              31     93.5
GET             51              30     70.0
SMC             32              11    190.9

Lucibee said:
Mean difference is NOT the same as difference in means.

That raises the question, what do they mean (!) by "mean difference"? I don't see how that has any sense unless there is some sort of pairing. You could have the mean difference across a time period within a treatment, but across treatments you don't have any pairs of data to calculate a "difference" that you could then take the mean of.

Lucibee said:
We already know that the original FOIA PACE data is OK. We are now entering very dangerous territory by putting a non-legit dataset out there.
If anyone has downloaded it and wants to look at it, I would suggest that they remove the former PACE data from it and separate out the secondary variables into separate sheets.
If they can't do this, I would delete it entirely. Otherwise there is a danger of coming across it at a later date, thinking it is OK, and sharing it.
We need to be really really careful about data contamination.

I agree. I can make a CSV file with just the original data available, with all of the other annotations that I added when building the omnibus file, and send it to John for him to replace the original. We could maybe leave the original up, with a big fat warning.

Alvin · May 12, 2019

JohnTheJack said:
I also asked for the Client Service Receipt Inventory scores.

Their response was:

Anyone have any ideas as to it's worth asking for anything?

I say its worth asking for.
Who knows what the data will tell us and if they are trying to hide it it would be for good reason.

Lucibee · May 13, 2019

sTeamTraen said:
I agree. I can make a CSV file with just the original data available, with all of the other annotations that I added when building the omnibus file, and send it to John for him to replace the original. We could maybe leave the original up, with a big fat warning.

You may agree, but I don't think you understand, Nick. The original (2016) data release file is fine and is available elsewhere. The version you produced by appending data to it is not. There is absolutely no point in leaving that version up, even with a big fat warning (I think @JohnTheJack has taken it down now anyway).

This stuff gives me nightmares. I think I need to step back from this for a bit. Sorry.

JohnTheJack · May 13, 2019

Lucibee said:
You may agree, but I don't think you understand, Nick. The original (2016) data release file is fine and is available elsewhere. The version you produced by appending data to it is not. There is absolutely no point in leaving that version up, even with a big fat warning (I think @JohnTheJack has taken it down now anyway).

This stuff gives me nightmares. I think I need to step back from this for a bit. Sorry.

The original has now been taken down.

Nick has created a new file with all the data together, but it the new data is clearly marked in red as not aligned and not to be used as if it were.

There is also a new 'readme' file that repeats the warning.

The latest files have also been uploaded separately.

The folder link is here.

I am going to contact them and say I will apply to reinstate the appeal unless I have the data properly aligned.

Thank you everyone, especially Lucy and Nick, for all your work and help.

JohnTheJack · May 13, 2019

Barry said:
Is it possible to have a copy of the unmerged dataset from this recent FOI data release please @JohnTheJack, as supplied to you? I know I could probably infer it by removing the data columns pertaining to original 2016 FOI release, but would prefer not to.

They are now in the Dropbox folder.
https://www.dropbox.com/sh/f3nfolkh1hlw9kg/AACb78M_jA3Q_NoBbnbQsWvXa?dl=0

chrisb · May 13, 2019

JohnTheJack said:
I am going to contact them and say I will apply to reinstate the appeal unless I have the data properly aligned.

Is it a necessary conclusion that the data have been altered in a way which would have been known to those releasing them for QMUL? Is it not possible that they merely released the file in the form which it was on their database assuming and trusting that it was in the form which had been used for the calculation, and innocently failing to check for amendments? There might have been some scrambling at an earlier date. But I know nothing of the processes which would have been undertaken, so feel free to ignore me.

JohnTheJack · May 13, 2019

chrisb said:
Is it a necessary conclusion that the data have been altered in a way which would have been known to those releasing them for QMUL? Is it not possible that they merely released the file in the form which it was on their database assuming and trusting that it was in the form which had been used for the calculation, and innocently failing to check for amendments? There might have been some scrambling at an earlier date. But I know nothing of the processes which would have been undertaken, so feel free to ignore me.

They should have been able to make the data correspond.

Adrian · May 13, 2019

JohnTheJack said:
I am going to contact them and say I will apply to reinstate the appeal unless I have the data properly aligned.

Thank you everyone, especially Lucy and Nick, for all your work and help.

It would be good to push for the step test which I think you requested and the actual eq5d data rather than the uk summary scores

Barry · May 13, 2019

JohnTheJack said:
I am going to contact them and say I will apply to reinstate the appeal unless I have the data properly aligned.

Would there be any chance of getting them to also include a couple of variables (or however many needed) from the original 2016 release to act as a surrogate ID to sort on. We would then not have to trust their sorting of the table rows, but could ourselves sort both old and new against the same variables. So long as those variables combine to a unique identifier per participant, we would be OK. See my earlier post #81, though I realise it would need more than the two columns I suggested (I was in a rush) to be unique.

Esther12 · May 13, 2019

JohnTheJack said:
I am going to contact them and say I will apply to reinstate the appeal unless I have the data properly aligned.

Depending on exactly what you requested in the first place, might it be better to make a new request? (This is a question from a position of near complete ignorance, I've forgotten all the details of this request). I'm not sure how much leeway there is in how a request can be interpreted, if there should be room for requests to be clarified during the appeals process, etc. I guess there's also a danger that a repeated similar request could be viewed as 'vexatious'?

I just thought I'd mention those things in case it's helpful to consider them.

JohnTheJack said:
Thank you everyone, especially Lucy and Nick, for all your work and help.

Yes, thanks to both.

Barry · May 13, 2019

chrisb said:
Is it a necessary conclusion that the data have been altered in a way which would have been known to those releasing them for QMUL? Is it not possible that they merely released the file in the form which it was on their database assuming and trusting that it was in the form which had been used for the calculation, and innocently failing to check for amendments? There might have been some scrambling at an earlier date. But I know nothing of the processes which would have been undertaken, so feel free to ignore me.

Yes. There may well be a disconnect, in terms of sort order, between the two data releases, and there may very well also be a disconnect between both releases and QMUL's data records. Sort order is not top of the priority list really when compiling data like this, and we only now have the problem because of nothing to marry the two sets of data with.

Ideally we would have one of, or both:

Unique participant IDs, common to all data releases. And/or ...
Each released dataset being a superset of previously released datasets.

In fact I personally would have been pleasantly surprised if the sort order had been the same between the two, given we only ever get subsets. The original data is held in a database, and this info is queried out of it, so in a way it just depends on how the queries are done I imagine, which are unlikely to be that close a match to queries for previous data releases.

Which is why I think it would be very hazardous to rely on them giving us the same sort order, even if they genuinely attempt to do so.

@Lucibee, @Adrian: Are there double checks that can be indirectly done to give confidence the sort orders are the same, if we get to that point?

Barry · May 13, 2019

Another very quick thought. We may have to be prepared for the possibility that QMUL may not actually be able to guarantee a particular sort order! Someone who better understands databases will know this better, but I'm guessing that when data is queried out of a db there will be a default sort applied, which likely depends on the parameters chosen and their values. Different parameters / values might well lead to a different sort order, and the person doing the querying not knowing (or caring) how things got sorted.

Snow Leopard · May 13, 2019

It is quite deliberate that they've made it so the data sets cannot be combined. I knew they had some sort of trick up their sleeves, I guess this is it.

sTeamTraen · May 13, 2019

JohnTheJack said:
They should have been able to make the data correspond.

I find it absolutely inconceivable that there is not, somewhere in the vaults at QMUL, a single file with all of the variables in it. That's how you analyze data from any study of any kind, unless there's some absolutely huge number of participants or variables.

Barry said:
Another very quick thought. We may have to be prepared for the possibility that QMUL may not actually be able to guarantee a particular sort order! Someone who better understands databases will know this better, but I'm guessing that when data is queried out of a db there will be a default sort applied, which likely depends on the parameters chosen and their values. Different parameters / values might well lead to a different sort order, and the person doing the querying not knowing (or caring) how things got sorted.

The data shouldn't be stored in a "database", in the sense of a traditional computer database management system where you have multiple tables and need keys to tie the various tables together. It will have been analyzed from a single file in the proprietary format of whichever software they used. Unusually, they report using Stata, SAS, and SPSS, which probably means they used a CSV file (a very plain, neutral file format that any statistical software package can read, including Excel) to transfer the data between the different pieces of software.

If that's the case (and if it isn't, they have a very weird setup), then the process of sending out variables selectively ought to be easy. Start with the master file sorted by participant ID, make a copy, and in that copy, delete all the variables except the ones you want to share. If you reassemble multiple chunks that were made that way, you will get back the original records.

Snow Leopard said:
It is quite deliberate that they've made it so the data sets cannot be combined. I knew they had some sort of trick up their sleeves, I guess this is it.

The fact that the sort order in the new files is not the same as the first file is either down to incompetence or malice. I always prefer the first of these explanations, but the level of incompetence (probably by multiple people) required in order for the newly-released variables not to be sorted correctly makes me wonder here.

Esther12 said:
I guess there's also a danger that a repeated similar request could be viewed as 'vexatious'?

This reminds me of a situation in another part of my life, which is a game where I have to gently ask people not to bend the rules. It's complicated, but let's say the maximum X they are allowed is 100, and they send in 115, so I say "No, please make it less than 100". They change it to 108 and I say "Please try again". They change it to 103 and I say "Please get it below 100", and then they go onto a forum and it's "The moderator is being such a nitpicker".

If this kind of argument gets wheeled out, it will be important to make sure the judge (etc) understands that it's not nitpicking or vexatious to point out that if the request was only 98% complied with, and the 2% missing (i.e., the correct sort) means that the other 98% is useless, then the request as a whole was not complied with.

Trish · May 13, 2019

I haven't followed all this discussion, and I am hugely grateful to @JohnTheJack and Alem Matthees for their fantastic efforts in getting some of the data.
Would it make sense to simply ask for all the data apart from that which could be seen as individual identifiers such as age and location. The entire data set must surely be stored in a single spreadsheet or similar.

More PACE trial data released

Senior Member (Voting Rights)

Senior Member (Voting Rights)

Senior Member (Voting Rights)

Senior Member (Voting Rights)

Established Member (Voting Rights)

Established Member (Voting Rights)

Senior Member (Voting Rights)

Senior Member (Voting Rights)

Senior Member (Voting Rights)

Senior Member (Voting Rights)

Senior Member (Voting Rights)

Senior Member (Voting Rights)

Administrator

Senior Member (Voting Rights)

Senior Member (Voting Rights)

Senior Member (Voting Rights)

Senior Member (Voting Rights)

Senior Member (Voting Rights)

Established Member (Voting Rights)

Moderator