More PACE trial data released

Discussion in 'Psychosomatic research - ME/CFS and Long Covid' started by JohnTheJack, May 7, 2019.

  1. Barry

    Barry Senior Member (Voting Rights)

    Messages:
    8,420
    Only had a short while to look at this, but I think it is possible to reliably align the original FOI dataset with the new ... unless I'm being silly and missing something.

    The new dataset is a superset of the previously released dataset so includes the same columns of information, but the rows being reordered. Each row being a particular participant of course. But something like wtms0 and wtms.52 are pretty unique values for each participant, so will likely act as surrogate participant IDs. I tried the following sort on both datasets and at a very quick glance then looked to then align OK. But I've not checked this rigorously. Even if there were a few duplicated rows, the likelihood should fall if more sort levels were added.

    upload_2019-5-12_21-0-12.png

    Or have I misunderstood what the problem is?
     
    Last edited: May 12, 2019
  2. Lucibee

    Lucibee Senior Member (Voting Rights)

    Messages:
    1,498
    Location:
    Mid-Wales
    Yes. You have misunderstood what the problem is. This is why this is so dangerous!

    The new dataset was NOT a superset of the previously released dataset.
    The dataset posted on here was generated based on the assumption that the data were in the same order. That assumption was INCORRECT. So that dataset is flawed and is contaminated. *klaxon*

    QMUL only provided the requested data, all of which are new data (well, at least to us). There was no previous data on which to do any successful merge.
     
    Last edited: May 12, 2019
  3. Barry

    Barry Senior Member (Voting Rights)

    Messages:
    8,420
    Ah ... I see @Lucibee's point now. That's a bummer. The fundamental issue is really that these data releases do not include a unique participant ID (anonymous of course), such as a relevant database key or suchlike; every data release needs to include the same ID parameter.

    ETA: I also now appreciate @Lucibee's point that this merged dataset absolutely must not be distributed, because the data is likely incoherent. All manner of screwed up analyses could get proliferated.
     
    MSEsperanza, MEMarge, Lucibee and 6 others like this.
  4. Barry

    Barry Senior Member (Voting Rights)

    Messages:
    8,420
    Is it possible to have a copy of the unmerged dataset from this recent FOI data release please @JohnTheJack, as supplied to you? I know I could probably infer it by removing the data columns pertaining to original 2016 FOI release, but would prefer not to.
     
    MSEsperanza and JohnTheJack like this.
  5. sTeamTraen

    sTeamTraen Established Member (Voting Rights)

    Messages:
    46
    The latest release came in 5 files, with names that began with the digits 2 through 6 (perhaps because 1 was the first file released? I wasn't around then).

    The variables in the new files are as follows:
    File "2 pace_EQ5D": eq_index.0, eq_index.24, eq_index.52
    File "3 pace_HAD": HAANT.0, HAANT.24, HAANT.52, HADET.0, HADET.24, HADET.52
    File "4 pace_WSAS": WSAT.0, WSAT.24, WSAT.52
    File "5 pace_BORG": STBOR.0, STBOR.24, STBOR.52
    File "6 pace_PHQ15": PSTOT.0, PSTOT.24, PSTOT.52

    You can reconstruct those files, exactly as I received them from John, by saving just the appropriate columns from the "omnibus" file that I assembled. That is, I did not do any sorting on these files, because I knew that I had to assume that they were sorted in the same order (of participant number) as the original file.

    I can also mail the files to you, if John isn't immediately available: nicholasjlbrown@gmail.com.
     
  6. sTeamTraen

    sTeamTraen Established Member (Voting Rights)

    Messages:
    46
    OK, done that. :) I see now.


    While I was continuing to reproduce Table 3, adding the number of people who improved, I decided to see whether the change of criteria (from Likert <= 18 to binary <= 3) benefited CBT and GET. It seems that the opposite was the case: The percentage of people either recovering or improving (using a change of 2 Likert points or 1 binary point, the latter of which I assumed would have been the criterion had they kept the binary version of the fatigue scale) was higher for APT and SMC.

    Code:
    
    TX    Improv-Likert    Improv-binary    %incr
    APT              99               65     52.3
    CBT             113               87     29.9
    GET             123               87     41.4
    SMC              98               62     58.1
    
    TX    Recov-Likert    Recov-binary    %incr
    APT             34              17    100.0
    CBT             60              31     93.5
    GET             51              30     70.0
    SMC             32              11    190.9
    
    


    That raises the question, what do they mean (!) by "mean difference"? I don't see how that has any sense unless there is some sort of pairing. You could have the mean difference across a time period within a treatment, but across treatments you don't have any pairs of data to calculate a "difference" that you could then take the mean of.

    I agree. I can make a CSV file with just the original data available, with all of the other annotations that I added when building the omnibus file, and send it to John for him to replace the original. We could maybe leave the original up, with a big fat warning.
     
  7. Alvin

    Alvin Senior Member (Voting Rights)

    Messages:
    3,309
    I say its worth asking for.
    Who knows what the data will tell us and if they are trying to hide it it would be for good reason.
     
    ScottTriGuy, EzzieD, Hoopoe and 2 others like this.
  8. Lucibee

    Lucibee Senior Member (Voting Rights)

    Messages:
    1,498
    Location:
    Mid-Wales
    You may agree, but I don't think you understand, Nick. The original (2016) data release file is fine and is available elsewhere. The version you produced by appending data to it is not. There is absolutely no point in leaving that version up, even with a big fat warning (I think @JohnTheJack has taken it down now anyway).

    This stuff gives me nightmares. I think I need to step back from this for a bit. Sorry.
     
  9. JohnTheJack

    JohnTheJack Moderator Staff Member

    Messages:
    4,789
    The original has now been taken down.

    Nick has created a new file with all the data together, but it the new data is clearly marked in red as not aligned and not to be used as if it were.

    There is also a new 'readme' file that repeats the warning.

    The latest files have also been uploaded separately.

    The folder link is here.

    I am going to contact them and say I will apply to reinstate the appeal unless I have the data properly aligned.

    Thank you everyone, especially Lucy and Nick, for all your work and help.
     
  10. JohnTheJack

    JohnTheJack Moderator Staff Member

    Messages:
    4,789
    Robert 1973, Joh, ScottTriGuy and 2 others like this.
  11. chrisb

    chrisb Senior Member (Voting Rights)

    Messages:
    4,602
    Is it a necessary conclusion that the data have been altered in a way which would have been known to those releasing them for QMUL? Is it not possible that they merely released the file in the form which it was on their database assuming and trusting that it was in the form which had been used for the calculation, and innocently failing to check for amendments? There might have been some scrambling at an earlier date. But I know nothing of the processes which would have been undertaken, so feel free to ignore me.
     
    Inara, ScottTriGuy, Barry and 2 others like this.
  12. JohnTheJack

    JohnTheJack Moderator Staff Member

    Messages:
    4,789
    They should have been able to make the data correspond.
     
  13. Adrian

    Adrian Administrator Staff Member

    Messages:
    6,563
    Location:
    UK
    It would be good to push for the step test which I think you requested and the actual eq5d data rather than the uk summary scores
     
    Joel, sea, Sean and 3 others like this.
  14. Barry

    Barry Senior Member (Voting Rights)

    Messages:
    8,420
    Would there be any chance of getting them to also include a couple of variables (or however many needed) from the original 2016 release to act as a surrogate ID to sort on. We would then not have to trust their sorting of the table rows, but could ourselves sort both old and new against the same variables. So long as those variables combine to a unique identifier per participant, we would be OK. See my earlier post #81, though I realise it would need more than the two columns I suggested (I was in a rush) to be unique.
     
    Last edited: May 13, 2019
    sea and MEMarge like this.
  15. Esther12

    Esther12 Senior Member (Voting Rights)

    Messages:
    4,393
    Depending on exactly what you requested in the first place, might it be better to make a new request? (This is a question from a position of near complete ignorance, I've forgotten all the details of this request). I'm not sure how much leeway there is in how a request can be interpreted, if there should be room for requests to be clarified during the appeals process, etc. I guess there's also a danger that a repeated similar request could be viewed as 'vexatious'?

    I just thought I'd mention those things in case it's helpful to consider them.


    Yes, thanks to both.
     
    Sean and MEMarge like this.
  16. Barry

    Barry Senior Member (Voting Rights)

    Messages:
    8,420
    Yes. There may well be a disconnect, in terms of sort order, between the two data releases, and there may very well also be a disconnect between both releases and QMUL's data records. Sort order is not top of the priority list really when compiling data like this, and we only now have the problem because of nothing to marry the two sets of data with.

    Ideally we would have one of, or both:
    • Unique participant IDs, common to all data releases. And/or ...
    • Each released dataset being a superset of previously released datasets.
    In fact I personally would have been pleasantly surprised if the sort order had been the same between the two, given we only ever get subsets. The original data is held in a database, and this info is queried out of it, so in a way it just depends on how the queries are done I imagine, which are unlikely to be that close a match to queries for previous data releases.

    Which is why I think it would be very hazardous to rely on them giving us the same sort order, even if they genuinely attempt to do so.

    @Lucibee, @Adrian: Are there double checks that can be indirectly done to give confidence the sort orders are the same, if we get to that point?
     
    Last edited: May 13, 2019
  17. Barry

    Barry Senior Member (Voting Rights)

    Messages:
    8,420
    Another very quick thought. We may have to be prepared for the possibility that QMUL may not actually be able to guarantee a particular sort order! Someone who better understands databases will know this better, but I'm guessing that when data is queried out of a db there will be a default sort applied, which likely depends on the parameters chosen and their values. Different parameters / values might well lead to a different sort order, and the person doing the querying not knowing (or caring) how things got sorted.
     
    ScottTriGuy likes this.
  18. Snow Leopard

    Snow Leopard Senior Member (Voting Rights)

    Messages:
    3,860
    Location:
    Australia
    It is quite deliberate that they've made it so the data sets cannot be combined. I knew they had some sort of trick up their sleeves, I guess this is it.
     
    2kidswithME, Sean, Inara and 4 others like this.
  19. sTeamTraen

    sTeamTraen Established Member (Voting Rights)

    Messages:
    46
    I find it absolutely inconceivable that there is not, somewhere in the vaults at QMUL, a single file with all of the variables in it. That's how you analyze data from any study of any kind, unless there's some absolutely huge number of participants or variables.

    The data shouldn't be stored in a "database", in the sense of a traditional computer database management system where you have multiple tables and need keys to tie the various tables together. It will have been analyzed from a single file in the proprietary format of whichever software they used. Unusually, they report using Stata, SAS, and SPSS, which probably means they used a CSV file (a very plain, neutral file format that any statistical software package can read, including Excel) to transfer the data between the different pieces of software.

    If that's the case (and if it isn't, they have a very weird setup), then the process of sending out variables selectively ought to be easy. Start with the master file sorted by participant ID, make a copy, and in that copy, delete all the variables except the ones you want to share. If you reassemble multiple chunks that were made that way, you will get back the original records.

    The fact that the sort order in the new files is not the same as the first file is either down to incompetence or malice. I always prefer the first of these explanations, but the level of incompetence (probably by multiple people) required in order for the newly-released variables not to be sorted correctly makes me wonder here.

    This reminds me of a situation in another part of my life, which is a game where I have to gently ask people not to bend the rules. It's complicated, but let's say the maximum X they are allowed is 100, and they send in 115, so I say "No, please make it less than 100". They change it to 108 and I say "Please try again". They change it to 103 and I say "Please get it below 100", and then they go onto a forum and it's "The moderator is being such a nitpicker".

    If this kind of argument gets wheeled out, it will be important to make sure the judge (etc) understands that it's not nitpicking or vexatious to point out that if the request was only 98% complied with, and the 2% missing (i.e., the correct sort) means that the other 98% is useless, then the request as a whole was not complied with.
     
    Last edited: May 13, 2019
  20. Trish

    Trish Moderator Staff Member

    Messages:
    55,414
    Location:
    UK
    I haven't followed all this discussion, and I am hugely grateful to @JohnTheJack and Alem Matthees for their fantastic efforts in getting some of the data.
    Would it make sense to simply ask for all the data apart from that which could be seen as individual identifiers such as age and location. The entire data set must surely be stored in a single spreadsheet or similar.
     

Share This Page