digitalmars.D - Developing Mars lander software

Walter Bright (19/19) Feb 18 2014 http://cacm.acm.org/magazines/2014/2/171689-mars-code/fulltext

Craig Dillabaugh (6/29) Feb 18 2014 I thought you were going to tell us it had been developed using
Tolga Cakiroglu (9/32) Feb 18 2014 TL;DR the link though, how are they detecting that a CPU fails?

Xinok (8/16) Feb 18 2014 I'm assuming this has something to do with it:

Tolga Cakiroglu (7/27) Feb 18 2014 I think only knowing that it has failed is not enough. Because

Xinok (20/49) Feb 19 2014 I don't have experience with, or much knowledge of, these kinds
Jesse Phillips (13/19) Feb 20 2014 I don't think watching the video answered this, but it hinted

Paulo Pinto (13/36) Feb 19 2014 Having read Code Complete when I was at the university, coupled

Walter Bright (3/7) Feb 19 2014 The unix toolchain did not port well to 16 bit machines. The C compilers...

Paulo Pinto (3/12) Feb 19 2014 Thanks for the info.
Russel Winder (11/13) Feb 19 2014 On the other hand, the UNIX tool chain worked fine for me on PDP-11s,

Walter Bright (5/12) Feb 19 2014 PC compilers needed to support multiple pointer types. The 11 did not ha...

Russel Winder (13/17) Feb 19 2014 Indeed.

Walter Bright <newshound2 digitalmars.com> writes:

http://cacm.acm.org/magazines/2014/2/171689-mars-code/fulltext

Some interesting tidbits:

"We later revised it to require that the flight software as a whole, and each 
module within it, had to reach a minimal assertion density of 2%. There is 
compelling evidence that higher assertion densities correlate with lower 
residual defect densities."

This has been my experience with asserts, too.

"A failing assertion is now tied in with the fault-protection system and by 
default places the spacecraft into a predefined safe state where the cause of 
the failure can be diagnosed carefully before normal operation is resumed."

Nice to see confirmation of that.

"Running the same landing software on two CPUs in parallel offers little 
protection against software defects. Two different versions of the 
entry-descent-and-landing code were therefore developed, with the version 
running on the backup CPU a simplified version of the primary version running
on 
the main CPU. In the case where the main CPU would have unexpectedly failed 
during the landing sequence, the backup CPU was programmed to take control and 
continue the sequence following the simplified procedure."

An example of using dual systems for reliability.

Feb 18 2014

"Craig Dillabaugh" <cdillaba cg.scs.carleton.ca> writes:

On Tuesday, 18 February 2014 at 23:05:21 UTC, Walter Bright wrote:
 http://cacm.acm.org/magazines/2014/2/171689-mars-code/fulltext

 Some interesting tidbits:

 "We later revised it to require that the flight software as a 
 whole, and each module within it, had to reach a minimal 
 assertion density of 2%. There is compelling evidence that 
 higher assertion densities correlate with lower residual defect 
 densities."

 This has been my experience with asserts, too.

 "A failing assertion is now tied in with the fault-protection 
 system and by default places the spacecraft into a predefined 
 safe state where the cause of the failure can be diagnosed 
 carefully before normal operation is resumed."

 Nice to see confirmation of that.

 "Running the same landing software on two CPUs in parallel 
 offers little protection against software defects. Two 
 different versions of the entry-descent-and-landing code were 
 therefore developed, with the version running on the backup CPU 
 a simplified version of the primary version running on the main 
 CPU. In the case where the main CPU would have unexpectedly 
 failed during the landing sequence, the backup CPU was 
 programmed to take control and continue the sequence following 
 the simplified procedure."

 An example of using dual systems for reliability.

I thought you were going to tell us it had been developed using 
D.  It only seems right that a Mars lander would use Digital Mars 
software. Plus isn't D safer than C99?

Maybe when they send the manned mission to Mars they will do the 
right thing:o)

Feb 18 2014

"Tolga Cakiroglu" <tcak pcak.com> writes:

On Tuesday, 18 February 2014 at 23:05:21 UTC, Walter Bright wrote:
 http://cacm.acm.org/magazines/2014/2/171689-mars-code/fulltext

 Some interesting tidbits:

 "We later revised it to require that the flight software as a 
 whole, and each module within it, had to reach a minimal 
 assertion density of 2%. There is compelling evidence that 
 higher assertion densities correlate with lower residual defect 
 densities."

 This has been my experience with asserts, too.

 "A failing assertion is now tied in with the fault-protection 
 system and by default places the spacecraft into a predefined 
 safe state where the cause of the failure can be diagnosed 
 carefully before normal operation is resumed."

 Nice to see confirmation of that.

 "Running the same landing software on two CPUs in parallel 
 offers little protection against software defects. Two 
 different versions of the entry-descent-and-landing code were 
 therefore developed, with the version running on the backup CPU 
 a simplified version of the primary version running on the main 
 CPU. In the case where the main CPU would have unexpectedly 
 failed during the landing sequence, the backup CPU was 
 programmed to take control and continue the sequence following 
 the simplified procedure."

 An example of using dual systems for reliability.

TL;DR the link though, how are they detecting that a CPU fails? 
An information must be passes outside of CPU to do this. The only 
solution comes to my mind is that main CPU changes a variable on 
an external memory at every step, and back up CPU checks it 
continuously to catch a failure immediately. But this would 
require about 50% of CPU's power already.

While thinking about this kind of back up systems, knowing and 
reading that some people are really doing is really great.

Feb 18 2014

"Xinok" <xinok live.com> writes:

On Wednesday, 19 February 2014 at 00:16:03 UTC, Tolga Cakiroglu 
wrote:
 TL;DR the link though, how are they detecting that a CPU fails? 
 An information must be passes outside of CPU to do this. The 
 only solution comes to my mind is that main CPU changes a 
 variable on an external memory at every step, and back up CPU 
 checks it continuously to catch a failure immediately. But this 
 would require about 50% of CPU's power already.

 While thinking about this kind of back up systems, knowing and 
 reading that some people are really doing is really great.

I'm assuming this has something to do with it:
https://en.wikipedia.org/wiki/Heartbeat_%28computing%29

In clustered servers, the active node sends a continuous signal 
indicating it's still alive. This signal is referred to as a 
heartbeat. There's a standby node waiting to take over should it 
stop receiving this signal.

Feb 18 2014

"Tolga Cakiroglu" <tcak pcak.com> writes:

On Wednesday, 19 February 2014 at 01:09:43 UTC, Xinok wrote:
 On Wednesday, 19 February 2014 at 00:16:03 UTC, Tolga Cakiroglu 
 wrote:
 TL;DR the link though, how are they detecting that a CPU 
 fails? An information must be passes outside of CPU to do 
 this. The only solution comes to my mind is that main CPU 
 changes a variable on an external memory at every step, and 
 back up CPU checks it continuously to catch a failure 
 immediately. But this would require about 50% of CPU's power 
 already.

 While thinking about this kind of back up systems, knowing and 
 reading that some people are really doing is really great.

 I'm assuming this has something to do with it:
 https://en.wikipedia.org/wiki/Heartbeat_%28computing%29

 In clustered servers, the active node sends a continuous signal 
 indicating it's still alive. This signal is referred to as a 
 heartbeat. There's a standby node waiting to take over should 
 it stop receiving this signal.

I think only knowing that it has failed is not enough. Because 
the process is landing, and other CPU should know where the 
process is left. With that heatbeat signal, only option is that 
all sensor information must be sent both CPUs continuously and 
sensor values should be enough about what next step to be taken. 
Then I think it can continue the process flawlessly.

Feb 18 2014

"Xinok" <xinok live.com> writes:

On Wednesday, 19 February 2014 at 05:53:55 UTC, Tolga Cakiroglu 
wrote:
 On Wednesday, 19 February 2014 at 01:09:43 UTC, Xinok wrote:
 On Wednesday, 19 February 2014 at 00:16:03 UTC, Tolga 
 Cakiroglu wrote:
 TL;DR the link though, how are they detecting that a CPU 
 fails? An information must be passes outside of CPU to do 
 this. The only solution comes to my mind is that main CPU 
 changes a variable on an external memory at every step, and 
 back up CPU checks it continuously to catch a failure 
 immediately. But this would require about 50% of CPU's power 
 already.

 While thinking about this kind of back up systems, knowing 
 and reading that some people are really doing is really great.

 I'm assuming this has something to do with it:
 https://en.wikipedia.org/wiki/Heartbeat_%28computing%29

 In clustered servers, the active node sends a continuous 
 signal indicating it's still alive. This signal is referred to 
 as a heartbeat. There's a standby node waiting to take over 
 should it stop receiving this signal.

 I think only knowing that it has failed is not enough. Because 
 the process is landing, and other CPU should know where the 
 process is left. With that heatbeat signal, only option is that 
 all sensor information must be sent both CPUs continuously and 
 sensor values should be enough about what next step to be 
 taken. Then I think it can continue the process flawlessly.

I don't have experience with, or much knowledge of, these kinds 
of systems; I'm merely aware of the concepts. The process of one 
system taking over when another system fails is called failover 
[1]. Depending on the requirements, the system could be designed 
so the standby node continues from the last successful state of 
the failed node [2].

To quote the page on Wikipedia [2], "Most importantly, the 
application must store as much of its state on non-volatile 
shared storage as possible. Equally important is the ability to 
restart on another node at the last state before failure using 
the saved state from the shared storage."

I would consider it likely that both systems run in conjunction, 
but the primary system is in control and the backup system merely 
"observes", ready to take over in an instant as soon as it no 
longer detects a heartbeat.

[1] https://en.wikipedia.org/wiki/Failover
[2] 
https://en.wikipedia.org/wiki/High-availability_cluster#Application_design_requirements

Feb 19 2014

"Jesse Phillips" <Jesse.K.Phillips+D gmail.com> writes:

On Wednesday, 19 February 2014 at 05:53:55 UTC, Tolga Cakiroglu 
wrote:
 I think only knowing that it has failed is not enough. Because 
 the process is landing, and other CPU should know where the 
 process is left. With that heatbeat signal, only option is that 
 all sensor information must be sent both CPUs continuously and 
 sensor values should be enough about what next step to be 
 taken. Then I think it can continue the process flawlessly.

I don't think watching the video answered this, but it hinted 
toward the second CPU being inactive during landing, if something 
went wrong the CPU would need to be awoken at which point the 
backup CPU would take in all the readings from different sensors 
to decide on actions (possibly it is intended only to land the 
rover and not land the rover in the correct location.)

What was interesting from the video is that the second CPU was 
going to be turned off for the landing and not used as a backup. 
A year before landing (I guess that means 3 months before launch) 
they decided to create the backup software if the main CPU failed 
during landing, it didn't.

Feb 20 2014

"Paulo Pinto" <pjmlp progtools.org> writes:

On Tuesday, 18 February 2014 at 23:05:21 UTC, Walter Bright wrote:
 http://cacm.acm.org/magazines/2014/2/171689-mars-code/fulltext

 Some interesting tidbits:

 "We later revised it to require that the flight software as a 
 whole, and each module within it, had to reach a minimal 
 assertion density of 2%. There is compelling evidence that 
 higher assertion densities correlate with lower residual defect 
 densities."

 This has been my experience with asserts, too.

 "A failing assertion is now tied in with the fault-protection 
 system and by default places the spacecraft into a predefined 
 safe state where the cause of the failure can be diagnosed 
 carefully before normal operation is resumed."

 Nice to see confirmation of that.

 "Running the same landing software on two CPUs in parallel 
 offers little protection against software defects. Two 
 different versions of the entry-descent-and-landing code were 
 therefore developed, with the version running on the backup CPU 
 a simplified version of the primary version running on the main 
 CPU. In the case where the main CPU would have unexpectedly 
 failed during the landing sequence, the backup CPU was 
 programmed to take control and continue the sequence following 
 the simplified procedure."

 An example of using dual systems for reliability.


Having read Code Complete when I was at the university, coupled
with Ada and Eiffel experience, allowed me to live better with C
by

- compiling with all warnings enabled as errors
- making judicious use of assert as poor man's contract system
- running the code regularly with static analyzers

Regarding the last point, I read somewhere that lint was actually
supposed to be part of C toolchain, but when most people tried to
port C to home computers it was highly overlooked, hence the
situation we got into with C.

--
Paulo

Feb 19 2014

Walter Bright <newshound2 digitalmars.com> writes:

On 2/19/2014 12:25 AM, Paulo Pinto wrote:
 Regarding the last point, I read somewhere that lint was actually
 supposed to be part of C toolchain, but when most people tried to
 port C to home computers it was highly overlooked, hence the
 situation we got into with C.

The unix toolchain did not port well to 16 bit machines. The C compilers for
the 
PC were pretty much all built from scratch.

Feb 19 2014

"Paulo Pinto" <pjmlp progtools.org> writes:

On Wednesday, 19 February 2014 at 08:49:36 UTC, Walter Bright
wrote:
 On 2/19/2014 12:25 AM, Paulo Pinto wrote:
 Regarding the last point, I read somewhere that lint was 
 actually
 supposed to be part of C toolchain, but when most people tried 
 to
 port C to home computers it was highly overlooked, hence the
 situation we got into with C.

 The unix toolchain did not port well to 16 bit machines. The C 
 compilers for the PC were pretty much all built from scratch.

Thanks for the info.

Feb 19 2014

Russel Winder <russel winder.org.uk> writes:

On Wed, 2014-02-19 at 00:49 -0800, Walter Bright wrote:
[…]
 The unix toolchain did not port well to 16 bit machines. The C compilers for
the 
 PC were pretty much all built from scratch.

On the other hand, the UNIX tool chain worked fine for me on PDP-11s,
which were very definitely 16-bit. Though we were very happy when we got
a VAX-11/750 and 32-bits.

-- 
Russel.
=============================================================================
Dr Russel Winder      t: +44 20 7585 2200   voip: sip:russel.winder ekiga.net
41 Buckmaster Road    m: +44 7770 465 077   xmpp: russel winder.org.uk
London SW11 1EN, UK   w: www.russel.org.uk  skype: russel_winder

Feb 19 2014

Walter Bright <newshound2 digitalmars.com> writes:

On 2/19/2014 2:21 AM, Russel Winder wrote:
 On Wed, 2014-02-19 at 00:49 -0800, Walter Bright wrote:
 […]
 The unix toolchain did not port well to 16 bit machines. The C compilers for
the
 PC were pretty much all built from scratch.

 On the other hand, the UNIX tool chain worked fine for me on PDP-11s,
 which were very definitely 16-bit. Though we were very happy when we got
 a VAX-11/750 and 32-bits.

PC compilers needed to support multiple pointer types. The 11 did not have 
segmented addresses, so this was irrelevant for the 11. Trying to retrofit unix 
compilers with near/far/huge turned out to not be so practical, at least I
don't 
know anyone who tried it.

Feb 19 2014

Russel Winder <russel winder.org.uk> writes:

On Wed, 2014-02-19 at 02:30 -0800, Walter Bright wrote:
[…]
 PC compilers needed to support multiple pointer types. The 11 did not have 
 segmented addresses, so this was irrelevant for the 11. Trying to retrofit
unix 
 compilers with near/far/huge turned out to not be so practical, at least I
don't 
 know anyone who tried it.

Indeed.

I much preferred the PDP-11,UNIX v6 approach to segmentation than the
8086,CP/M,DOS approach.  Though it wasn't fun when there was a
segmentation violation of course, I hate violation. The only thing worse
was bus error.

-- 
Russel.
=============================================================================
Dr Russel Winder      t: +44 20 7585 2200   voip: sip:russel.winder ekiga.net
41 Buckmaster Road    m: +44 7770 465 077   xmpp: russel winder.org.uk
London SW11 1EN, UK   w: www.russel.org.uk  skype: russel_winder

Feb 19 2014

D Programming

C/C++ Programming

Other

digitalmars.D - Developing Mars lander software