www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - [gsoc] Mir project

reply Seb <seb wilzba.ch> writes:
So here's another discussion thread for this year's GSoC. This 
time it's about the Mir project.
The wiki already contains a few infos:

https://wiki.dlang.org/GSOC_2019_Ideas#Mir_Project

 community: what do you miss the most in the Mir project?

Also, Ilya recently sent my a long email to one student with more 
details on the DataFrame project and I wanted to make it 
available to all students:

---

You can choose almost any project you can finish during GSoC.

Common requirements if you choose me as your mentor:
1. You should care about time limits, reports, and GSoC 
formalities yourself.
2. I wouldn't spend a lot of time on GSoC. Almost all the things 
you would need to understand yourself. I will only formulate a 
final goal, API and implementation requirement, intermediate 
goals. If you would do cool things I would help to do them even 
better.
3. The final result of the project should have a sensible 
positive impact on Mir and D in general. A project should be 
completely ready to be accepted.
4. A GSoC project should have professional quality. You would 
need to become a professional in the field you choose a GSoC 
project, this is a mandatory requirement.

For example, if you choose to implement basic matrix operations 
in D, then the two links to start would be:
- Anatomy of High-Performance Matrix Multiplication 
(https://www.cs.utexas.edu/users/flame/pubs/GotoTOMS_final.pdf)
- [Experimental] LLVM-accelerated Generic Linear Algebra 
Subprograms (https://github.com/libmir/mir-glas)

To work on GLAS you would need to understand well Goto's paper, 
LLVM IR, SIMD programming with LDC, GLAS source code.

DataFrame project
=================

mir-algorithm package (https://github.com/libmir/mir-random) has 
Slice/ndslice (numpy.ndarray analog) and Series (pandas.Series 
analog). Series should be fused into Slice, Slice would be a 
generalized multidimensional DataFrame analog. Labels (indexes) 
will be optional, the current Slice API and speed will be 
preserved. However, this would make the development of generic 
libraries hard. To make it simpler, we need to improve D language 
and DMD compiler. This can be split into two parts: language 
change (DIP)  and pull request with required changes in DMD.

The DataFrame GSoC project results will be accepted if you write 
the 'clever alias' DIP AND the DIP is approved by Andrei 
Alexandrescu and Walter Bright before the end of the GSoC AND you 
will also do at least one of the following:
1. Implement the DIP for DMD compiler. (DMD is written in D, but 
I have no idea about its internals) OR
2. Add Labels(Indexes) support to ndslices package to make Slice 
a generalization of DataFrame

It is quite a risky project, comparing with GLAS and FFT  the 
DataFrame project also requires very well communication skills, a 
lot of patience and some luck.

Links to start with for DataFrame:

https://issues.dlang.org/show_bug.cgi?id=16486
https://issues.dlang.org/show_bug.cgi?id=16465

The brief DIP idea is that the code like below should work:

alias PackedUpperTriangularMatrix(T) = Slice!(StairsIterator!(T*, 
"-"));

// fails, issue 16486
auto foo(T)(PackedUpperTriangularMatrix!T m)
{
}

// Current workaround: it is too crazy for users to
// know what is StairsIterator!(T*, "-")).
auto foo(T)(Slice!(StairsIterator!(T*, "-")) m)
{
}

Currently used Slice types in Lubeck / Production code
Slice!(double*) - D slice analog
Slice!(double*, 1, Universal) - BLAS vector, used in mir-lapack 
and mir-blas.
Slice!(double*, 2) - Contiguous matrix, that has an efficient 
loop for iteration over elements, see mir.algorithm.iteration 
sources.
Slice!(double*, 2, Canonical) - BLAS/LAPACK matrix 
representation, used in mir-lapack and mir.blas
Slice!(double*, N, Universal) - zero copy view to work with 
ndarray in numpy, see also low level API bindgins, and high level 
bindings
Slice!(StairsIterator!(double*, "+")) and ...
Slice!(StairsIterator!(double*, "-")) - packed storage for 
triangular matrixes, for BLAS/LAPACK
Slice!(ChopIterator!(size_t*, uint*)); - Memory efficient graph 
representation without labels.
Possible future Slice types (2019?):
Slice!(double*, 1, Contiguous, string*) - like Pandas Series
Slice!(double*, 2, Contiguous, LabelT1*, LabelT2*) - like Pandas 
DataFrame
Slice!(double*, 2, Contiguous, LabelT1*, LabelT2*, LabelT3*) - 
like Pandas Panel
Slice!(ChopIterator!(size_t*, uint*), 1, Contiguous, string*); - 
Memory efficient graph representation with labels.
Slice!(ChopIterator!(size_t*, Slice!(double*, 1, Contiguous, 
uint*))) - Sparse Matrix representation that can be used to 
interact with existing C/C++/Fortran libraries

If you would be able to write a good DIP and create a pull 
request with its implementation it would be awesome. I can pay 
400$ as a bonus if the DIP implementation is merged to DMD.
---
Mar 19 2019
next sibling parent reply jmh530 <john.michael.hall gmail.com> writes:
On Tuesday, 19 March 2019 at 11:16:37 UTC, Seb wrote:
 [snip]

 DataFrame project
 =================

 [snip]

 If you would be able to write a good DIP and create a pull 
 request with its implementation it would be awesome. I can pay 
 400$ as a bonus if the DIP implementation is merged to DMD.
 ---
Oh, I like this a lot. I'd throw in an extra $100 if it gets merged. Just getting the DIP written and an implementation would be very good. Getting the DIP approved might take longer than 3 months though.
Mar 19 2019
parent reply Nicholas Wilson <iamthewilsonator hotmail.com> writes:
On Tuesday, 19 March 2019 at 11:34:51 UTC, jmh530 wrote:
 Just getting the DIP written and an implementation would be 
 very good. Getting the DIP approved might take longer than 3 
 months though.
I'm not sure why a DIP is needed and I'm not sure why those issues are closed resolved fixed, since they don't work, but anyway, with the dconf AGM I'm hoping to remove a lot of the backlog and streamline the process so that it shouldn't take forever. Who knows, might even get in principle pre-approval. I think implementation is going to be the hard part, then again it does sound easy in theory to fix...
Mar 19 2019
parent jmh530 <john.michael.hall gmail.com> writes:
On Tuesday, 19 March 2019 at 11:51:22 UTC, Nicholas Wilson wrote:
 [snip

 I'm not sure why a DIP is needed and I'm not sure why those 
 issues are closed resolved fixed, since they don't work,

 [snip]
One is Resolved Invalid, the other is Resolved Duplicate, you follow the rabbit hole on the other and you get to https://issues.dlang.org/show_bug.cgi?id=1807 There also was some reference to https://issues.dlang.org/show_bug.cgi?id=6082 in the Resolved Invalid one.
Mar 19 2019
prev sibling parent reply Iti Shree <itishree1999 gmail.com> writes:
Will it be alright if I upload proposal for mir project (Cpuid) 
but also work on  DataFrame project (outside gsoc)? Because the 
work sounds interesting but since you said it's risky I find it 
better to work on mir-cpuid first and then if I have enough time 
work on DataFrame project.
Mar 29 2019
next sibling parent Seb <seb wilzba.ch> writes:
On Friday, 29 March 2019 at 11:21:53 UTC, Iti Shree wrote:
 Will it be alright if I upload proposal for mir project (Cpuid) 
 but also work on  DataFrame project (outside gsoc)? Because the 
 work sounds interesting but since you said it's risky I find it 
 better to work on mir-cpuid first and then if I have enough 
 time work on DataFrame project.
Of course, that's okay. You can also submit multiple proposals via the GSoC platform.
Mar 29 2019
prev sibling parent Nicholas Wilson <iamthewilsonator hotmail.com> writes:
On Friday, 29 March 2019 at 11:21:53 UTC, Iti Shree wrote:
 Will it be alright if I upload proposal for mir project (Cpuid) 
 but also work on  DataFrame project (outside gsoc)?
Yes, but remember you only have so much time ;)
Mar 29 2019