digitalmars.D.learn - dataframe implementations

Jay Norwood (23/23) Nov 02 2015 I was reading about the Julia dataframe implementation yesterday,

Laeeth Isharc (19/44) Nov 02 2015 Hi Jay.

Jay Norwood (4/9) Nov 02 2015 yes, thanks. I believe I did see your comments previously.

Jay Norwood (22/22) Nov 17 2015 I looked through the dataframe code and a couple of comments...

Laeeth Isharc (27/50) Nov 18 2015 Yes - I think that one will want to have a choice between this

Laeeth Isharc (6/31) Nov 18 2015 What do you think about the use of NaN for missing floats? In

Jay Norwood (17/22) Nov 18 2015 The julia discussions mention another dataframe implementation, I

Jay Norwood (5/6) Nov 18 2015 Here are the two discussions I recall on the julia NA

Jay Norwood (4/4) Nov 18 2015 One more discussion link on the NA subject. This one on the R

jmh530 (3/7) Nov 18 2015 My sense is that any data frame implementation should try to

Jay Norwood (11/13) Nov 18 2015 I've been watching that development, but I don't have a feel for

John Colvin (3/16) Nov 19 2015 You might not build on the nd slice type itself, but implementing
ZombineDev (4/17) Nov 19 2015 How about using a nd slice of Variant(s), or a more specialized

Laeeth Isharc (4/26) Nov 21 2015 Not sure it is a great idea to use a variant as the basic option

Jay Norwood (18/21) Dec 03 2015 I'm reading today about an n-dim extension to pandas named xray.

jmh530 (2/6) Nov 20 2015 I meant in the sense that Pandas is built upon Numpy.

Jay Norwood <jayn prismnet.com> writes:

I was reading about the Julia dataframe implementation yesterday, 
trying to understand their decisions and how D might implement.

 From my notes,
1. they are currently using a dictionary of column vectors.
2. for NA (not available) they are currently using an array of 
bytes, effectively as a Boolean flag, rather than a bitVector, 
for performance reasons.
3. they are not currently implementing hierarchical headers.
4. they are transforming non-valid symbol header strings (read 
from csv, for example) to valid symbols by replacing '.' with 
underscore and prefixing numbers with 'x', as examples.  This 
allows use in expressions.
5. Along with 4., they currently have  with for DataVector, to 
allow expressions to use, for example, :symbol_name instead of 
dv[:symbol_name].
6. They have operation symbols for per element operations on two 
vectors, for example a ./ b expresses applying the operation to 
the vector.
7. They currently only have row indexes,  no row names or symbols.

I saw someone posting that they were working on DataFrame 
implementation here, but haven't been able to locate any code in 
github, and was wondering what implementation decisions are being 
made here.  Thanks.

Nov 02 2015

Laeeth Isharc <laeethnospam nospamlaeeth.com> writes:

On Monday, 2 November 2015 at 13:54:09 UTC, Jay Norwood wrote:
 I was reading about the Julia dataframe implementation 
 yesterday, trying to understand their decisions and how D might 
 implement.

 From my notes,
 1. they are currently using a dictionary of column vectors.
 2. for NA (not available) they are currently using an array of 
 bytes, effectively as a Boolean flag, rather than a bitVector, 
 for performance reasons.
 3. they are not currently implementing hierarchical headers.
 4. they are transforming non-valid symbol header strings (read 
 from csv, for example) to valid symbols by replacing '.' with 
 underscore and prefixing numbers with 'x', as examples.  This 
 allows use in expressions.
 5. Along with 4., they currently have  with for DataVector, to 
 allow expressions to use, for example, :symbol_name instead of 
 dv[:symbol_name].
 6. They have operation symbols for per element operations on 
 two vectors, for example a ./ b expresses applying the 
 operation to the vector.
 7. They currently only have row indexes,  no row names or 
 symbols.

 I saw someone posting that they were working on DataFrame 
 implementation here, but haven't been able to locate any code 
 in github, and was wondering what implementation decisions are 
 being made here.  Thanks.

Hi Jay.

That may have been me.  I have implemented something very basic, 
but you can read and write my proto dataframe to/from CSV and 
HDF5.  The code is up here:

https://github.com/Laeeth/d_dataframes

You should think of it as a crude prototype that nonetheless has 
been useful for me, but it's done more in the old school hacker 
spirit of getting something working first rather than being 
designed properly.  The reason for that is I have a lot on my 
plate at the moment, and technology is only one of many of these, 
although an important one.  In time I may get someone else to 
work on dataframes and opensource the results, but that may be 
some months away.

So I'd welcome any assistance, or even taking it over.  I haven't 
really done a good job of having idiomatic access, but it's 
something and a start.


Laeeth.


I

Nov 02 2015

Jay Norwood <jayn prismnet.com> writes:

On Monday, 2 November 2015 at 15:33:34 UTC, Laeeth Isharc wrote:
 Hi Jay.

 That may have been me.  I have implemented something very 
 basic, but you can read and write my proto dataframe to/from 
 CSV and HDF5.  The code is up here:

 https://github.com/Laeeth/d_dataframes

yes, thanks.  I believe I did see your comments previously.
That's great that you've already got support for hdf5. I'll take 
a look.

Nov 02 2015

Jay Norwood <jayn prismnet.com> writes:

I looked through the dataframe code and a couple of comments...

I had thought perhaps an app could read in the header info and 
type info from hdf5, and generate D struct definitions with 
column headers as symbol names.  That would enable faster 
processing than with the associative arrays, as well as support 
the auto-completion that would be helpful in writing expressions.

The csv type info for columns could be inferred, or else stated 
in the reader call, as done as an option in julia.

In both cases the column names would have to be valid symbol 
names for this to work.  I believe Julia also expects this, or 
else does some conversion on your column names to make them valid 
symbols. I think the D csv processing would also need to check if 
the

The jupyter interactive environment supports python pandas and 
Julia dataframe column names in the autocompletion, and so I 
think the D debugging environment would need to provide similar 
capability if it is to be considered as a fast-recompile 
substitute for interactive dataframe exploration.

It seems to me that your particular examples of stock data would 
eventually need to handle missing data, as supported in Julia 
dataframes and python pandas.  They both provide ways to drop or 
fill missing values.  Did you want to support that?

Nov 17 2015

Laeeth Isharc <laeethnospam nospam.laeeth.com> writes:

On Tuesday, 17 November 2015 at 13:56:14 UTC, Jay Norwood wrote:
 I looked through the dataframe code and a couple of comments...

 I had thought perhaps an app could read in the header info and 
 type info from hdf5, and generate D struct definitions with 
 column headers as symbol names.  That would enable faster 
 processing than with the associative arrays, as well as support 
 the auto-completion that would be helpful in writing 
 expressions.

Yes - I think that one will want to have a choice between this 
kind of approach and using associative arrays.  Because for some 
purposes it's not convenient to have to compile code every time 
you open a strange file, and on the other hand the hit with an AA 
sometimes will matter.

The situation at the moment for me is that I have very little 
time to work on a correct general solution for this problem 
myself (yet its important for D that we do get to one).  I also 
lack the experience with D to do it very well very quickly.  I do 
have a couple of seasoned people from the community helping me 
with things, but dataframes won't be the first thing they look 
at, and it could be a while before we get to that.  If we 
implement for our own needs,then I will open source it as it is 
commercially sensible as well as the right thing to do.  But that 
could be a year away.

Vlad Levenfeld was also looking at this a bit.


 The csv type info for columns could be inferred, or else stated 
 in the reader call, as done as an option in julia.

 In both cases the column names would have to be valid symbol 
 names for this to work.  I believe Julia also expects this, or 
 else does some conversion on your column names to make them 
 valid symbols. I think the D csv processing would also need to 
 check if the

 The jupyter interactive environment supports python pandas and 
 Julia dataframe column names in the autocompletion, and so I 
 think the D debugging environment would need to provide similar 
 capability if it is to be considered as a fast-recompile 
 substitute for interactive dataframe exploration.

Well we don't need to get there in a single bound - already just 
being able to do this at all is a big improvement, and I am 
already using D with jupyter to do things.

 It seems to me that your particular examples of stock data 
 would eventually need to handle missing data, as supported in 
 Julia dataframes and python pandas.  They both provide ways to 
 drop or fill missing values.  Did you want to support that?

Yes - we should do so eventually, and there's much more that 
could be done.  But maybe a sensible basic implementation is a 
start and we can refine after that.

I wrote the dataframe in a couple of evenings, so I am sure it 
can be improved, and even rearchitected.  Pull requests welcomed, 
and maybe we should set up a Trello to organise ideas ?  Let me 
know if you are in.

Nov 18 2015

Laeeth Isharc <laeethnospam nospam.laeeth.com> writes:

On Monday, 2 November 2015 at 13:54:09 UTC, Jay Norwood wrote:
 I was reading about the Julia dataframe implementation 
 yesterday, trying to understand their decisions and how D might 
 implement.

 From my notes,
 1. they are currently using a dictionary of column vectors.
 2. for NA (not available) they are currently using an array of 
 bytes, effectively as a Boolean flag, rather than a bitVector, 
 for performance reasons.
 3. they are not currently implementing hierarchical headers.
 4. they are transforming non-valid symbol header strings (read 
 from csv, for example) to valid symbols by replacing '.' with 
 underscore and prefixing numbers with 'x', as examples.  This 
 allows use in expressions.
 5. Along with 4., they currently have  with for DataVector, to 
 allow expressions to use, for example, :symbol_name instead of 
 dv[:symbol_name].
 6. They have operation symbols for per element operations on 
 two vectors, for example a ./ b expresses applying the 
 operation to the vector.
 7. They currently only have row indexes,  no row names or 
 symbols.

 I saw someone posting that they were working on DataFrame 
 implementation here, but haven't been able to locate any code 
 in github, and was wondering what implementation decisions are 
 being made here.  Thanks.

What do you think about the use of NaN for missing floats?  In 
theory I could imagine wanting to distinguish between an NaN in 
the source file and a missing value, but in my world I never felt 
the need for this.  For integers and bools, that is different of 
course.

Nov 18 2015

Jay Norwood <jayn prismnet.com> writes:

On Wednesday, 18 November 2015 at 17:15:38 UTC, Laeeth Isharc 
wrote:
 What do you think about the use of NaN for missing floats?  In 
 theory I could imagine wanting to distinguish between an NaN in 
 the source file and a missing value, but in my world I never 
 felt the need for this.  For integers and bools, that is 
 different of course.

The julia discussions mention another dataframe implementation, I 
believe it was for R, where NaN was used.  There was some mention 
of the virtues of their own choice and the problems with NaN.  I 
think use of NaN was a particular encoding of NaN.  Other 
implementations they mentioned used some reserved value in each 
of the numeric data types to represent NA.  In the julia case, I 
believe what they use is a separate byte vector for each column 
that holds the NA status.  They discussed some other possible 
enhancements, but I don't know what they implemented.  For 
example, if the single byte holds the NA flag, the cell value can 
hold additional info ... maybe the reason for the NA.  There was 
also some discussion of having the associated cell hold repeat 
counts for the NA status, which I suppose meant to repeat it for 
following cells in the column vector.  I'll try to find the 
discussions and post the link.

Nov 18 2015

Jay Norwood <jayn prismnet.com> writes:

On Wednesday, 18 November 2015 at 18:04:30 UTC, Jay Norwood wrote:
 vector.  I'll try to find the discussions and post the link.

Here are the two discussions I recall on the julia NA 
implementation.

http://wizardmac.tumblr.com/post/104019606584/whats-wrong-with-statistics-in-julia-a-reply
https://github.com/JuliaLang/julia/pull/9363

Nov 18 2015

Jay Norwood <jayn prismnet.com> writes:

One more discussion link on the NA subject. This one on the R 
implementation of NA using a single encoding of NaN, as well as 
their treatment of a selected integer value as a NA.

http://rsnippets.blogspot.com/2013/12/gnu-r-vs-julia-is-it-only-matter-of.html

Nov 18 2015

jmh530 <john.michael.hall gmail.com> writes:

On Monday, 2 November 2015 at 13:54:09 UTC, Jay Norwood wrote:
 I saw someone posting that they were working on DataFrame 
 implementation here, but haven't been able to locate any code 
 in github, and was wondering what implementation decisions are 
 being made here.  Thanks.

My sense is that any data frame implementation should try to 
build on the work that's being done with n-dimensional slices.

Nov 18 2015

Jay Norwood <jayn prismnet.com> writes:

On Wednesday, 18 November 2015 at 22:46:01 UTC, jmh530 wrote:
 My sense is that any data frame implementation should try to 
 build on the work that's being done with n-dimensional slices.

I've been watching that development, but I don't have a feel for 
where it could be applied in this case, since it appears to be 
focused on multi-dimensional slices of the same data type, 
slicing up a single range.

The dataframes often consist of different data types by column.

How did you see the nd slices being used?

Maybe the nd slices could be applied if you considered each row 
to be the same structure, and slice by rows rather than operating 
on columns.  Pandas supports a multi-dimension panel.  Maybe this 
would be the application for nd slices by row.

Nov 18 2015

John Colvin <john.loughran.colvin gmail.com> writes:

On Thursday, 19 November 2015 at 06:33:06 UTC, Jay Norwood wrote:
 On Wednesday, 18 November 2015 at 22:46:01 UTC, jmh530 wrote:
 My sense is that any data frame implementation should try to 
 build on the work that's being done with n-dimensional slices.

 I've been watching that development, but I don't have a feel 
 for where it could be applied in this case, since it appears to 
 be focused on multi-dimensional slices of the same data type, 
 slicing up a single range.

 The dataframes often consist of different data types by column.

 How did you see the nd slices being used?

 Maybe the nd slices could be applied if you considered each row 
 to be the same structure, and slice by rows rather than 
 operating on columns.  Pandas supports a multi-dimension panel.
  Maybe this would be the application for nd slices by row.

You might not build on the nd slice type itself, but implementing 
the same API (where possible/appropriate) would be good.

Nov 19 2015

ZombineDev <valid_email he.re> writes:

On Thursday, 19 November 2015 at 06:33:06 UTC, Jay Norwood wrote:
 On Wednesday, 18 November 2015 at 22:46:01 UTC, jmh530 wrote:
 My sense is that any data frame implementation should try to 
 build on the work that's being done with n-dimensional slices.

 I've been watching that development, but I don't have a feel 
 for where it could be applied in this case, since it appears to 
 be focused on multi-dimensional slices of the same data type, 
 slicing up a single range.

 The dataframes often consist of different data types by column.

 How did you see the nd slices being used?

 Maybe the nd slices could be applied if you considered each row 
 to be the same structure, and slice by rows rather than 
 operating on columns.  Pandas supports a multi-dimension panel.
  Maybe this would be the application for nd slices by row.

How about using a nd slice of Variant(s), or a more specialized 
type Algebraic type?

[1]: http://dlang.org/phobos/std_variant

Nov 19 2015

Laeeth Isharc <laeethnospam nospam.laeeth.com> writes:

On Thursday, 19 November 2015 at 22:14:01 UTC, ZombineDev wrote:
 On Thursday, 19 November 2015 at 06:33:06 UTC, Jay Norwood 
 wrote:
 On Wednesday, 18 November 2015 at 22:46:01 UTC, jmh530 wrote:
 My sense is that any data frame implementation should try to 
 build on the work that's being done with n-dimensional slices.

 I've been watching that development, but I don't have a feel 
 for where it could be applied in this case, since it appears 
 to be focused on multi-dimensional slices of the same data 
 type, slicing up a single range.

 The dataframes often consist of different data types by column.

 How did you see the nd slices being used?

 Maybe the nd slices could be applied if you considered each 
 row to be the same structure, and slice by rows rather than 
 operating on columns.  Pandas supports a multi-dimension panel.
  Maybe this would be the application for nd slices by row.

 How about using a nd slice of Variant(s), or a more specialized 
 type Algebraic type?

 [1]: http://dlang.org/phobos/std_variant

Not sure it is a great idea to use a variant as the basic option 
when very often you will know that every cell in a particular 
column will be of the same type.

Nov 21 2015

Jay Norwood <jayn prismnet.com> writes:

On Saturday, 21 November 2015 at 14:16:26 UTC, Laeeth Isharc 
wrote:
 Not sure it is a great idea to use a variant as the basic 
 option when very often you will know that every cell in a 
 particular column will be of the same type.


I'm reading today about an n-dim extension to pandas named xray.  
Maybe should try to understand how that fits.  They support io 
from netCDF, and are making extensions to support blocked input 
using dask, so they can process data larger than in-memory limits.

http://xray.readthedocs.org/en/stable/data-structures.html
https://www.continuum.io/content/xray-dask-out-core-labeled-arrays-python


In general, pandas and xray are supporting with the requirement 
of pulling in data from storage of initially unknown column and 
index names and data types.  Julia throws in support of jit 
compilation and specialized operations for different data types.

It seems to me that D's strength would be in a quick compile, 
which would then allow you to replace the dictionary tag 
implementations and variants with something that used compile 
time symbol names and data types. Seems like that would provide 
more efficient processing, as well as better tab completion 
support when creating expressions.

Dec 03 2015

jmh530 <john.michael.hall gmail.com> writes:

On Thursday, 19 November 2015 at 06:33:06 UTC, Jay Norwood wrote:
 Maybe the nd slices could be applied if you considered each row 
 to be the same structure, and slice by rows rather than 
 operating on columns.  Pandas supports a multi-dimension panel.
  Maybe this would be the application for nd slices by row.

I meant in the sense that Pandas is built upon Numpy.

Nov 20 2015

D Programming

C/C++ Programming

Other

digitalmars.D.learn - dataframe implementations