Data set to help wi...
Clear all

Data set to help with validation for ISO17025

New Member


I am currently in my final year studying computer forensics at the University of South Wales. For my final year project I am creating a data set that digital forensic labs can use in the validation part of their ISO 17025 accreditation. I have completed the research part of my project and am now focusing on the design stage. This is where I am going to decide what I want my data set to validate. I am finding it hard to find a balance of validating things that are far too specific and don’t actually get used in a lab which is pointless with things that are far too basic and are actually quite easy to validate already, making my data set redundant.

With my supervisor (Gareth Davies) we have come up with some lower level ideas, for example when a document is written in a different language will the document be extracted and processed accurately and maintain the integrity of what the document says despite the language difference?

I would really appreciate an input from those who who know what things are done on a day to day basis in a lab, as to what things you think I can use my data set to validate?

Hope this all makes sense?


Topic starter Posted : 13/01/2020 10:52 am
Senior Member

Makes perfect sense…..since that's the entire problem with forcing ISO17025 onto Digital Forensics.

You have an astronomic number of combinations when considering the number of forensic tools, versions of those tools, forensic processes in those tools or artefacts to test for, versions of the software that generated the artefact, configurations of that software, hardware source evidence device types, source file-systems, source OSes, configurations of those OSes/file-systems, language variants of the OS/software, potentially interacting software/processes on the target machine, methods of acquisition/representation (software/hardware/storage), the myriad of combinations of all of the above, and probably many many other things, all which might impact how the data is stored, how it's been modified, or how it's captured.

Whatever it is that you pick to test, you'll only be testing an absolutely miniscule proportion of the possible configurations of source data, whatever the thing being focused on. So, whilst your language idea isn't a bad one, I'd bet most labs aren't testing for that……..but they're also not going to be testing for an almost infinite number of possible ways the software (or hardware) might not perform as desired.

I realise this isn't entirely constructive, due to my dislike for ISO17025's imposition on DF, but the point was to try to illustrate that whatever your data set contains, there's going to be countless other ways/forms the data could appear/arise on a device, and be interpreted by a tool. So I would argue there's not (m)any things that are "basic" and "easy to validate" already unless the scope is narrowed to an almost meaningless extent.

Posted : 13/01/2020 1:33 pm
Community Legend

About documents (or more general data) in a foreign language, particularly if a different alphabet/writing is involved there are two aspects
1) is the technical part of the integrity (as an example validating that "queer" Unicode characters render/are interpreted properly
2) the actual procedure of translation

Two examples, both relatively fresh (the first one not particularly "foreign").

In Italy there is "Chamber of Commerce" (one for each Province/City) where a number of data about companies are officially "stored", and to enter into most non-trivial contracts a (mandated by Law) attachment is needed consisting on a certificate of the Chamber of Commerce that specifies some fiscal data, the "scope" of the company, whether it is active/not subject to liquidation, etc., etc. with a date not earlier than (usually) 6 months.
Since a number of years you don't queue anymore every 6 months or so to get it, we are digital, modern and what not so you can ask (for a fee) to have a (digitally signed) .pdf via "certified" e-mail (which is then printed, and attached to the paper contract, thus losing in a swift single move all the integrity that the digital signature amd certified e-mail provide).
For some *reasons* the (demented) programmers of the (useless) institution did not properly consider that in Italian a large number of words contain accented vowels, i.e. àéèìòù (that can be both lower case or CAPITAL), so that this certificate misses each and every accented vowels but since the capital accented vowels are not available on the keyboard directly, they are sometimes replaced by the normal vowel + apex, i.e. A' E' I' O' U' and so these latter render normally, still if the text is in lower case or the proper accented capital latter has been used …
Anyone reading the certificate will have to guess which vowel is missing (Wheel of Fortune anyone?).

Real example from a real certificate


A few years ago in the investigations for the murder of a poor girl (only as a quick reference)

the first suspect arrested was a young man from Morocco that according to the Police translation (three different translators, I have to presume all of them qualified) had said in Arabic on the phone something *like*
"May Allah forgive me, but I didn't kill her."

Later the same phone call recording was analyzed by another 7 (seven) translators/interpreters, of which 4 declared they couldn't translate it at all, and 3 gave completely different meanings, including
"Allah, make him answer"

In the end, after a total of 16 (sixteen) different translators/interpreters gave their own translation it was finally determined that what was said was more *like*
"Allah, do facilitate my travel (back to Morocco)"
"Allah, help us get to the ship (to Morocco)".

This last example is obviously extreme, and it is not about written data, still …


Posted : 13/01/2020 3:27 pm
Active Member

I'm already thinking this post seems like groundhog day from a similar post not that long ago…..

So I would say that creating a data set to validate a range of techniques is beyond what can be done in a final year timeline.
With that in mind, why not pick a specific process and map out requirements and data set for that purpose. As you have already mentioned foreign languages and keywords, I will give some ideas of how you could do it for that purpose.

So with keyword searching, if we take a basic requirement of "must find specified text" we can then start mapping out the other requirements based on this.
So must find text encoded as x (UTF-8, ASCII, UTF-16)
Must find text in document (doc, docx, pdf)
Must find text in archive/compressed files (zip, hiberfil.sys…..)
Must be able to search for REGEX (GREP) expressions.
Must be able to find text in pictures (Using OCR or similar)
Need to look into different file systems and OS's and decide if these have an effect on anything as well and whether they need to be a variable on the tests.

From here, you can look at creating data sets that meet one or more criteria with known results.
I would suggest looking at what data is already out there on various websites and then filling the gaps with ones you need.

I hope this makes sense.

Posted : 13/01/2020 4:23 pm
Community Legend

I would really appreciate an input from those who who know what things are done on a day to day basis in a lab, as to what things you think I can use my data set to validate?

TL/DR Forget ISO 17025 and such; focus on important validation areas.

I think you're mixing up two things. ISO 17025 and tool validation. For ISO 17025 I doubt that you can create any kind of validation data you basically perform a revision to check if a realization it works as expected

While a particular realization of ISO 17025 may refer to particular tools (X, Y and Z, say), there's nothing that requires those tools to be used by all such realizations. Basically, you appear to conflate ISO 17025 validation and tool validation.

Additionally, while tools may be validated on general grounds (for example, verifying that a particular tool really does interpret images of ISO 9660 standard – something no tool I know of is able to do completely), it is tool usage under the organizations SOP that really matters. if tool X provides RAID restoration, but the lab actually uses tool Y for that, tool X validation of that particular aspect is fairly uninteresting. (It is interesting as part of tool evaluation prior to tool adoption, but that seems to be all)

That is you have more of a smörgåsbord situation many tests that a particular lab selects from and combines to make up *their* particular validation test suite.

Such validation suites need to be created by experts on the validation subject. I leave to you and your supervisor to evaluate that particular issue.

There are several general areas that badly need validation suites in order to ensure tools provide correct information as input to a court of justice. This areas should be identified. (I must admit a bias here – I'm dabbling in this field myself – but I'll try to avoid personal interests.)

Many or all aspects of system time is one example does the tool correctly identify how system time is established, and how well system time is maintained during operation? Is the tool capable of interpreting time stamps correctly and with the necessary precision and accuracy? Are local time time stamps in logs identified as such, or are they mistaken for UT or UTC time stamps? Does the tool handle time zones correctly? (And perhaps also does the tool correctly handle changes in daylight savings definition? That is, not just the usual changes twice a year, but changes of what time zone or DST rules to be observed from one year to the next? A SOP may need to specify that someone should keep track of if such changes affect lab work.)

Some aspects of file system interpretation are also important does the tool interpret data and metadata in file system F correctly? This is a huge job file system entity time stamps is one, and already covered, but it need to be done for every important file system. And, you probably need the tool to flag any file systems it shouldn't be used for, according to the SOP. If tool E cannot do UDF reliably, a user error should not allow it to be used by mistake.

This, obviously, extends to archive file format like ZIP, RAR, ZOO and so on. Example very, very few tools that claim to read gzip archives (as specified by RFC 1952) actually do so correctly. And ZIP can be extended – are presence of such extensions (especially unknown ones) reported as expected?

Some file systems even provide for specifying the code table used for file data they can tell you if the content of a file uses Russian or Greek or other character sets. (ISO 9660 is one of them.) Does your tool tell you when that situation is at hand? Does its search functions adapt to this?

(I am tempted to add Windows registry interpretation to file system interpretation there are many paralells.)

Test data is only one aspect test instruction are probably just as important. While final instruction rests with the lab, general instructions may point out areas where there are or may be risk of bad interpretation of data. (One such area is the tendency of some software to suppress trailing parts of time stamps that are zero e.g. 010000 becomes 01; or 2020-01-01 000000 becomes 2020-01-01, and so looks like a date rather than a time – risk for misinterpretation. Similar, some software reports bad or uninterpretable timestamps as blank whitespace can be difficult to interpret without special training. )

Getting a kind of high-altitude map of such considerations what areas are important enough to require a high degree of data quality in general (time is probably one of them; file metadata another), and what areas are important only in special cases (such as varying time zone definitions).

Good luck!

(Added You asked this or at least a very similar question last year as well. I expect most of the answers from then apply now as well – mine certainly does …)

Posted : 14/01/2020 8:20 am
Community Legend

(I am tempted to add Windows registry interpretation to file system interpretation there are many paralells.)

Sure, they actually are a "same" thing, JFYI




Posted : 14/01/2020 9:31 am
Share to...