digitalmars.D - Feature Request: Hashed Based Assertion

tcak (40/40) Nov 26 2015 I brought this topic in "Learn" a while ago, but I want to talk

tcak (3/6) Nov 26 2015 One applicable solution: __traits( hashOf,

Andrea Fontana (2/9) Nov 26 2015 Can't you calculate hash of involved files at compile time?

tcak (7/17) Nov 26 2015 One file can consist of many API functions. If there are 50

Jacob Carlborg (5/9) Nov 26 2015 With a complete D front end working at compile time it would at least be...

qznc (3/14) Nov 26 2015 This is the job of the type checker, isn't it? What would a hash
Idan Arye (12/15) Nov 26 2015 So it's not just the function's signature you want to hash, but
bitwise (14/55) Nov 26 2015 I'm wondering if a diff tool could be somehow combined with a
deadalnix (11/11) Nov 26 2015 I see many solution here that do not require any language change.

tcak (11/22) Nov 27 2015 Not one thing in your solutions give any simple solution like:

bitwise (8/9) Nov 27 2015 Yes, because to achieve what you're asking for, you NEED a

tcak (18/27) Nov 27 2015 Let me explain:

bitwise (15/46) Nov 27 2015 Your approach is prone to false positives.

tcak (12/66) Nov 27 2015 Question: Has the behaviour of foo changed?

qznc (12/20) Nov 27 2015 If you really want to integrate this into the language, you

deadalnix (29/54) Nov 27 2015 If the API signature change, the type system will yell at you.

=?UTF-8?B?Tm9yZGzDtnc=?= (15/17) Nov 27 2015 I've thought about this too in the past and asked on the forums

tcak <1ltkrs+3wyh1ow7kzn1k sharklasers.com> writes:

I brought this topic in "Learn" a while ago, but I want to talk 
about it again.

You are in a big team or working with a big code base. APIs are 
being defined/modified, configuration constants are 
defined/modified, structures are defined/modified for data.

You are coding on business logic side, and relying everything 
based on current APIs, configuration, and data structures. A part 
of codes have been updated on API side, but you are not aware of 
it, or time has passed, and you assume that your code will work 
properly. Nobody would be checking every single part of business 
logic line by line.

On runtime, you will get unexpected results, and lose some hair 
till finding where the problem is. Also finding expected results 
on a long running processes would cause much more trouble.

---

What I do currently is that: I calculate the hash of API code 
(function, configuration, etc together) with a hash function, and 
store it where the API is defined as a constant.

public enum HASH_OF_THIS_API = 0x1234;

// Hash is calculated from here
public void my_api_function(){}

public enum my_api_constant = 5;
// till here

Then wherever I use that API, I insert a "static assert( 
HASH_OF_THIS_API == 0x1234 );".

Whoever modifies the API, after the modification, calculates the 
most recent code's hash value and updates the constant. This 
allows compiler to warn the business logic programmer about 
changes on API codes. So, changing parts can be reviewed and 
changes are made if required.

---

The feature request part comes here: It is possible that API 
programmer forgets to update the hash value in the code. Also, 
comments in the code shouldn't affect the hash value. Automation 
is required on compile-time, so the compiler automatically 
calculates the hash value of code, and it can be read on 
compile-time. Hence, no constant is required to store the hash 
value.

What is needed is to be able to bind a hash value to any block 
with a name.

Nov 26 2015

tcak <1ltkrs+3wyh1ow7kzn1k sharklasers.com> writes:

On Thursday, 26 November 2015 at 11:12:07 UTC, tcak wrote:
 I brought this topic in "Learn" a while ago, but I want to talk 
 about it again.

 [...]

One applicable solution: __traits( hashOf, 
apiFunctionName/structName/variableName/className )

Nov 26 2015

Andrea Fontana <nospam example.com> writes:

On Thursday, 26 November 2015 at 11:14:54 UTC, tcak wrote:
 On Thursday, 26 November 2015 at 11:12:07 UTC, tcak wrote:
 I brought this topic in "Learn" a while ago, but I want to 
 talk about it again.

 [...]

 One applicable solution: __traits( hashOf, 
 apiFunctionName/structName/variableName/className )

Can't you calculate hash of involved files at compile time?

Nov 26 2015

tcak <1ltkrs+3wyh1ow7kzn1k sharklasers.com> writes:

On Thursday, 26 November 2015 at 11:18:19 UTC, Andrea Fontana 
wrote:
 On Thursday, 26 November 2015 at 11:14:54 UTC, tcak wrote:
 On Thursday, 26 November 2015 at 11:12:07 UTC, tcak wrote:
 I brought this topic in "Learn" a while ago, but I want to 
 talk about it again.

 [...]

 One applicable solution: __traits( hashOf, 
 apiFunctionName/structName/variableName/className )

 Can't you calculate hash of involved files at compile time?

One file can consist of many API functions. If there are 50 
functions in it, and only 1 of them has been modified, whole hash 
will change. Compiler cannot tell which API has been changed 
then. Purpose is to decrease the burden on programmer, and put it 
onto compiler.

Nov 26 2015

Jacob Carlborg <doob me.com> writes:

On 2015-11-26 12:24, tcak wrote:

 One file can consist of many API functions. If there are 50 functions in
 it, and only 1 of them has been modified, whole hash will change.
 Compiler cannot tell which API has been changed then. Purpose is to
 decrease the burden on programmer, and put it onto compiler.

With a complete D front end working at compile time it would at least be 
possible in theory.

-- 
/Jacob Carlborg

Nov 26 2015

qznc <qznc web.de> writes:

On Thursday, 26 November 2015 at 11:12:07 UTC, tcak wrote:
 I brought this topic in "Learn" a while ago, but I want to talk 
 about it again.

 You are in a big team or working with a big code base. APIs are 
 being defined/modified, configuration constants are 
 defined/modified, structures are defined/modified for data.

 You are coding on business logic side, and relying everything 
 based on current APIs, configuration, and data structures. A 
 part of codes have been updated on API side, but you are not 
 aware of it, or time has passed, and you assume that your code 
 will work properly. Nobody would be checking every single part 
 of business logic line by line.

This is the job of the type checker, isn't it? What would a hash 
provide that a type checker does not?

Nov 26 2015

Idan Arye <GenericNPC gmail.com> writes:

On Thursday, 26 November 2015 at 11:12:07 UTC, tcak wrote:
 I brought this topic in "Learn" a while ago, but I want to talk 
 about it again.

 [...]

So it's not just the function's signature you want to hash, but 
it's code as well? What about functions called from the API 
function? Or functions that set data that'll later be used by the 
API functions?

If anything, I would have hashed the unittests of the API 
function. If the behavior of the API function changes in a 
fashion that requires a modification of the unittest, then you 
might need to alert the business logic programmers. Anything less 
than that is just useless noise that'll hide the actual changes 
you want to be warned about among the endless clutter created by 
trivial changes.

Nov 26 2015

bitwise <bitwise.pvt gmail.com> writes:

On Thursday, 26 November 2015 at 11:12:07 UTC, tcak wrote:
 I brought this topic in "Learn" a while ago, but I want to talk 
 about it again.

 You are in a big team or working with a big code base. APIs are 
 being defined/modified, configuration constants are 
 defined/modified, structures are defined/modified for data.

 You are coding on business logic side, and relying everything 
 based on current APIs, configuration, and data structures. A 
 part of codes have been updated on API side, but you are not 
 aware of it, or time has passed, and you assume that your code 
 will work properly. Nobody would be checking every single part 
 of business logic line by line.

 On runtime, you will get unexpected results, and lose some hair 
 till finding where the problem is. Also finding expected 
 results on a long running processes would cause much more 
 trouble.

 ---

 What I do currently is that: I calculate the hash of API code 
 (function, configuration, etc together) with a hash function, 
 and store it where the API is defined as a constant.

 public enum HASH_OF_THIS_API = 0x1234;

 // Hash is calculated from here
 public void my_api_function(){}

 public enum my_api_constant = 5;
 // till here

 Then wherever I use that API, I insert a "static assert( 
 HASH_OF_THIS_API == 0x1234 );".

 Whoever modifies the API, after the modification, calculates 
 the most recent code's hash value and updates the constant. 
 This allows compiler to warn the business logic programmer 
 about changes on API codes. So, changing parts can be reviewed 
 and changes are made if required.

 ---

 The feature request part comes here: It is possible that API 
 programmer forgets to update the hash value in the code. Also, 
 comments in the code shouldn't affect the hash value. 
 Automation is required on compile-time, so the compiler 
 automatically calculates the hash value of code, and it can be 
 read on compile-time. Hence, no constant is required to store 
 the hash value.

 What is needed is to be able to bind a hash value to any block 
 with a name.

I'm wondering if a diff tool could be somehow combined with a 
parser to create a list of functions/symbols which may have 
experienced behavioural changes between versions of dmd. What I'm 
suggesting is a diff tool which is aware of a symbol's 
dependancies so that even if a function body wasn't changed, its 
dependant symbols could be checked as well.

If such a tool existed, it could be ran against each new release 
of dmd, and produce a comma separated list of functions that may 
have experienced behavioural changes. With that list in hand, one 
could then simply grep for each symbol in their own repository 
each time they upgrade dmd.

I hearby place this idea in the public domain ;)

    Bit

Nov 26 2015

deadalnix <deadalnix gmail.com> writes:

I see many solution here that do not require any language change. 
To start, have a linter yell at the programmer when (s)he submit 
a diff. Dev commit directly ? What the fuck are you doing ? Do 
code review and get a linter.

Alternatively, generate a di file and hash it. You can have a bot 
do it and commit with a commit hook.

DMD can dump infos about the program in json format. hash this 
and run with it.

You may also change your strategy in term of source control: 
https://www.youtube.com/watch?v=W71BTkUbdqE . Unified source code 
aleviate completely these kind of issues to boot.

Nov 26 2015

tcak <1ltkrs+3wyh1ow7kzn1k sharklasers.com> writes:

On Friday, 27 November 2015 at 05:33:52 UTC, deadalnix wrote:
 I see many solution here that do not require any language 
 change. To start, have a linter yell at the programmer when 
 (s)he submit a diff. Dev commit directly ? What the fuck are 
 you doing ? Do code review and get a linter.

 Alternatively, generate a di file and hash it. You can have a 
 bot do it and commit with a commit hook.

 DMD can dump infos about the program in json format. hash this 
 and run with it.

 You may also change your strategy in term of source control: 
 https://www.youtube.com/watch?v=W71BTkUbdqE . Unified source 
 code aleviate completely these kind of issues to boot.

Not one thing in your solutions give any simple solution like:

static assert( __traits( hashOf, std.file.read ) == 0x1234, "They 
have changed implementation again." );

static assert( __traits( hashOf, facebook.apis.addUser ) == 
0x5543, "Check API documentation again for addUser." );



di file wouldn't work. It doesn't contain implementation code. 
Also, all APIs are in it. We need specific hash for each API, so 
it doesn't take long time to find where the problem is.

JSON is same as di. No difference.


Yours are not helping, making everything more complex.

Nov 27 2015

bitwise <bitwise.pvt gmail.com> writes:

On Friday, 27 November 2015 at 08:09:27 UTC, tcak wrote:
 Yours are not helping, making everything more complex.

Yes, because to achieve what you're asking for, you NEED a 
complex solution.

The code WILL change with every release..thats the point of a 
release.. so any hashing mechanism like you're describing will 
just trigger every time, making it useless. Even if this was not 
the case, you still wouldn't know where the changes were.

     Bit

Nov 27 2015

tcak <1ltkrs+3wyh1ow7kzn1k sharklasers.com> writes:

On Friday, 27 November 2015 at 16:18:52 UTC, bitwise wrote:
 On Friday, 27 November 2015 at 08:09:27 UTC, tcak wrote:
 Yours are not helping, making everything more complex.

 Yes, because to achieve what you're asking for, you NEED a 
 complex solution.

 The code WILL change with every release..thats the point of a 
 release.. so any hashing mechanism like you're describing will 
 just trigger every time, making it useless. Even if this was 
 not the case, you still wouldn't know where the changes were.

     Bit

Let me explain:

It is not complex. What makes it complex is that you envision a 
very detailed thing.

Hash of a Function = MD5( Token List of Function /* but ignore 
comments */ );


You do not have to know where the changes are. You need to know 
what has changed,
how it acts currently briefly.


If behaviour of code changes, it is good that you know it. With 
above hashing method, a piece of code that hasn't changed would 
have same hash value always. And
if you do not like it, don't check the hash value. Just continue 
writing your codes as you wish. But in business perspective, if 
the software's consistency is worth millions of dollars, a 
software engineer would want it to be giving error whenever
codes change. Do we want D to be a child language, or have more 
useful features?

Nov 27 2015

bitwise <bitwise.pvt gmail.com> writes:

On Friday, 27 November 2015 at 18:51:54 UTC, tcak wrote:
 On Friday, 27 November 2015 at 16:18:52 UTC, bitwise wrote:
 On Friday, 27 November 2015 at 08:09:27 UTC, tcak wrote:
 Yours are not helping, making everything more complex.

 Yes, because to achieve what you're asking for, you NEED a 
 complex solution.

 The code WILL change with every release..thats the point of a 
 release.. so any hashing mechanism like you're describing will 
 just trigger every time, making it useless. Even if this was 
 not the case, you still wouldn't know where the changes were.

     Bit

 Let me explain:

 It is not complex. What makes it complex is that you envision a 
 very detailed thing.

 Hash of a Function = MD5( Token List of Function /* but ignore 
 comments */ );


 You do not have to know where the changes are. You need to know 
 what has changed,
 how it acts currently briefly.


 If behaviour of code changes, it is good that you know it. With 
 above hashing method, a piece of code that hasn't changed would 
 have same hash value always. And
 if you do not like it, don't check the hash value. Just 
 continue writing your codes as you wish. But in business 
 perspective, if the software's consistency is worth millions of 
 dollars, a software engineer would want it to be giving error 
 whenever
 codes change. Do we want D to be a child language, or have more 
 useful features?

Your approach is prone to false positives.

if(1) doSomething();
if(1) { doSomething(); }

Same behaviour, different code.
I hope you have a heck of a coding standard written up ;)

Worse still, consider the following example:

void foo() { if(bar()) deleteSomeFiles(); }
int bar() { return 0; }

Your proposed approach would not notify you that foo(), a 
potentially dangerous function, has changed it's behaviour if 
someone made bar() return 1.

*insert witty comeback to your comment about "business 
perspective" here*

     Bit

Nov 27 2015

tcak <1ltkrs+3wyh1ow7kzn1k sharklasers.com> writes:

On Friday, 27 November 2015 at 20:00:16 UTC, bitwise wrote:
 On Friday, 27 November 2015 at 18:51:54 UTC, tcak wrote:
 On Friday, 27 November 2015 at 16:18:52 UTC, bitwise wrote:
 On Friday, 27 November 2015 at 08:09:27 UTC, tcak wrote:
 Yours are not helping, making everything more complex.

 Yes, because to achieve what you're asking for, you NEED a 
 complex solution.

 The code WILL change with every release..thats the point of a 
 release.. so any hashing mechanism like you're describing 
 will just trigger every time, making it useless. Even if this 
 was not the case, you still wouldn't know where the changes 
 were.

     Bit

 Let me explain:

 It is not complex. What makes it complex is that you envision 
 a very detailed thing.

 Hash of a Function = MD5( Token List of Function /* but ignore 
 comments */ );


 You do not have to know where the changes are. You need to 
 know what has changed,
 how it acts currently briefly.


 If behaviour of code changes, it is good that you know it. 
 With above hashing method, a piece of code that hasn't changed 
 would have same hash value always. And
 if you do not like it, don't check the hash value. Just 
 continue writing your codes as you wish. But in business 
 perspective, if the software's consistency is worth millions 
 of dollars, a software engineer would want it to be giving 
 error whenever
 codes change. Do we want D to be a child language, or have 
 more useful features?

 Your approach is prone to false positives.

 if(1) doSomething();
 if(1) { doSomething(); }

 Same behaviour, different code.
 I hope you have a heck of a coding standard written up ;)

 Worse still, consider the following example:

 void foo() { if(bar()) deleteSomeFiles(); }
 int bar() { return 0; }

 Your proposed approach would not notify you that foo(), a 
 potentially dangerous function, has changed it's behaviour if 
 someone made bar() return 1.

 *insert witty comeback to your comment about "business 
 perspective" here*

     Bit

Question: Has the behaviour of foo changed?

If foo cares about bar's behaviour, foo checks bar's hash value.

--

if(1) doSomething();
if(1) { doSomething(); }

You are correct here about hash calculation, but unless someone 
touches to codes, this never happens, and no hash changes would 
be seen. If someone is touching it as you exampled, checking the 
documentation about what has happened would be the correct 
approach. Importance of behaviour change is perceptional, 
computer cannot know that already.

Nov 27 2015

qznc <qznc web.de> writes:

On Friday, 27 November 2015 at 20:19:40 UTC, tcak wrote:
 if(1) doSomething();
 if(1) { doSomething(); }

 You are correct here about hash calculation, but unless someone 
 touches to codes, this never happens, and no hash changes would 
 be seen. If someone is touching it as you exampled, checking 
 the documentation about what has happened would be the correct 
 approach. Importance of behaviour change is perceptional, 
 computer cannot know that already.

If you really want to integrate this into the language, you 
should consider future improvements.

Hashing the tokens is a conservative approximation of "behavior 
change", as the example above shows. Another example would be 
variable renames. The specification of the hash algorithm should 
provide the freedom that both variants above get the same hash, 
but still be correct in the sense that different behavior always 
yields different hashes.

Overall, I'm not convinced that this needs to be a language 
extension or trait. It could simple a static analysis tool 
independent of the compiler.

Nov 27 2015

deadalnix <deadalnix gmail.com> writes:

On Friday, 27 November 2015 at 08:09:27 UTC, tcak wrote:
 On Friday, 27 November 2015 at 05:33:52 UTC, deadalnix wrote:
 I see many solution here that do not require any language 
 change. To start, have a linter yell at the programmer when 
 (s)he submit a diff. Dev commit directly ? What the fuck are 
 you doing ? Do code review and get a linter.

 Alternatively, generate a di file and hash it. You can have a 
 bot do it and commit with a commit hook.

 DMD can dump infos about the program in json format. hash this 
 and run with it.

 You may also change your strategy in term of source control: 
 https://www.youtube.com/watch?v=W71BTkUbdqE . Unified source 
 code aleviate completely these kind of issues to boot.

 Not one thing in your solutions give any simple solution like:

 static assert( __traits( hashOf, std.file.read ) == 0x1234, 
 "They have changed implementation again." );

 static assert( __traits( hashOf, facebook.apis.addUser ) == 
 0x5543, "Check API documentation again for addUser." );



 di file wouldn't work. It doesn't contain implementation code. 
 Also, all APIs are in it. We need specific hash for each API, 
 so it doesn't take long time to find where the problem is.

 JSON is same as di. No difference.


 Yours are not helping, making everything more complex.

If the API signature change, the type system will yell at you. 
All the proposed solution will work.

If the implementation change, you can apply the same solution on 
the binary, tadaaa ! If you want less hash change, a good idea 
can be to dump llvm ir from ldc, and run the cannibalization on 
it using opt.

Also, if you have so much code that rely on implementation 
details that aren't in the API to the extent it is such a problem 
that you need language extension to handle it, you are doing 
something very very wrong.

Indeed I'm not helping. You think you need a language extension, 
when it is quite obvious you have some methodology problem on 
your side and refuse to reconsider.

What about, I know it is crazy, use a unified repository, have 
test and continuous integration, and submit diff with code 
review. If one change an API in a way that break the client code, 
the client ill fail and the CI tool will warn the developer that 
he needs to fix the client code or rework his API change. If the 
client code was not tested, then the problem is clearly not the 
API hash.

Not only this doesn't require language extension, but this solves 
way more problems than the one you want to solve here.

Now, don't get we wrong, I know how it is. Companies with broken 
work culture won't change anything unless the it is on the edge 
of bankruptcy. I understand. This is how it works.

Please understand that, on the other side, it doesn't seems like 
the right move to export broken work environment as language 
features.

Nov 27 2015

=?UTF-8?B?Tm9yZGzDtnc=?= <per.nordlow gmail.com> writes:

On Thursday, 26 November 2015 at 11:12:07 UTC, tcak wrote:
 What is needed is to be able to bind a hash value to any block 
 with a name.

I've thought about this too in the past and asked on the forums 
but I haven't gotten any response.

It is possible. The problem is easier in dynamic languages. See 
for instance a the following solution in a specific Python 
runtime here: http://pgbovine.net/incpy.html

`hashOf` is for AAs not for content digests.

I believe the only realistic solution to this problem is to 
implement a specific pass in the D compiler that recursively 
calculates hash-digests (hash-chains) for all the code and data 
involved in a function call. It should probably only work for 
pure functions. AFAICT, it is possible but it's far from easy to 
get 100% correct :)

DMD pull requests should be very welcomed, at least by me ;)

See also: https://en.wikipedia.org/wiki/Hash_chain

Nov 27 2015

D Programming

C/C++ Programming

Other

digitalmars.D - Feature Request: Hashed Based Assertion