www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - dataframe implementations

reply Jay Norwood <jayn prismnet.com> writes:
I was reading about the Julia dataframe implementation yesterday, 
trying to understand their decisions and how D might implement.

 From my notes,
1. they are currently using a dictionary of column vectors.
2. for NA (not available) they are currently using an array of 
bytes, effectively as a Boolean flag, rather than a bitVector, 
for performance reasons.
3. they are not currently implementing hierarchical headers.
4. they are transforming non-valid symbol header strings (read 
from csv, for example) to valid symbols by replacing '.' with 
underscore and prefixing numbers with 'x', as examples.  This 
allows use in expressions.
5. Along with 4., they currently have  with for DataVector, to 
allow expressions to use, for example, :symbol_name instead of 
dv[:symbol_name].
6. They have operation symbols for per element operations on two 
vectors, for example a ./ b expresses applying the operation to 
the vector.
7. They currently only have row indexes,  no row names or symbols.

I saw someone posting that they were working on DataFrame 
implementation here, but haven't been able to locate any code in 
github, and was wondering what implementation decisions are being 
made here.  Thanks.
Nov 02 2015
next sibling parent reply Laeeth Isharc <laeethnospam nospamlaeeth.com> writes:
On Monday, 2 November 2015 at 13:54:09 UTC, Jay Norwood wrote:
 I was reading about the Julia dataframe implementation 
 yesterday, trying to understand their decisions and how D might 
 implement.

 From my notes,
 1. they are currently using a dictionary of column vectors.
 2. for NA (not available) they are currently using an array of 
 bytes, effectively as a Boolean flag, rather than a bitVector, 
 for performance reasons.
 3. they are not currently implementing hierarchical headers.
 4. they are transforming non-valid symbol header strings (read 
 from csv, for example) to valid symbols by replacing '.' with 
 underscore and prefixing numbers with 'x', as examples.  This 
 allows use in expressions.
 5. Along with 4., they currently have  with for DataVector, to 
 allow expressions to use, for example, :symbol_name instead of 
 dv[:symbol_name].
 6. They have operation symbols for per element operations on 
 two vectors, for example a ./ b expresses applying the 
 operation to the vector.
 7. They currently only have row indexes,  no row names or 
 symbols.

 I saw someone posting that they were working on DataFrame 
 implementation here, but haven't been able to locate any code 
 in github, and was wondering what implementation decisions are 
 being made here.  Thanks.
Hi Jay. That may have been me. I have implemented something very basic, but you can read and write my proto dataframe to/from CSV and HDF5. The code is up here: https://github.com/Laeeth/d_dataframes You should think of it as a crude prototype that nonetheless has been useful for me, but it's done more in the old school hacker spirit of getting something working first rather than being designed properly. The reason for that is I have a lot on my plate at the moment, and technology is only one of many of these, although an important one. In time I may get someone else to work on dataframes and opensource the results, but that may be some months away. So I'd welcome any assistance, or even taking it over. I haven't really done a good job of having idiomatic access, but it's something and a start. Laeeth. I
Nov 02 2015
parent reply Jay Norwood <jayn prismnet.com> writes:
On Monday, 2 November 2015 at 15:33:34 UTC, Laeeth Isharc wrote:
 Hi Jay.

 That may have been me.  I have implemented something very 
 basic, but you can read and write my proto dataframe to/from 
 CSV and HDF5.  The code is up here:

 https://github.com/Laeeth/d_dataframes
yes, thanks. I believe I did see your comments previously. That's great that you've already got support for hdf5. I'll take a look.
Nov 02 2015
parent reply Jay Norwood <jayn prismnet.com> writes:
I looked through the dataframe code and a couple of comments...

I had thought perhaps an app could read in the header info and 
type info from hdf5, and generate D struct definitions with 
column headers as symbol names.  That would enable faster 
processing than with the associative arrays, as well as support 
the auto-completion that would be helpful in writing expressions.

The csv type info for columns could be inferred, or else stated 
in the reader call, as done as an option in julia.

In both cases the column names would have to be valid symbol 
names for this to work.  I believe Julia also expects this, or 
else does some conversion on your column names to make them valid 
symbols. I think the D csv processing would also need to check if 
the

The jupyter interactive environment supports python pandas and 
Julia dataframe column names in the autocompletion, and so I 
think the D debugging environment would need to provide similar 
capability if it is to be considered as a fast-recompile 
substitute for interactive dataframe exploration.

It seems to me that your particular examples of stock data would 
eventually need to handle missing data, as supported in Julia 
dataframes and python pandas.  They both provide ways to drop or 
fill missing values.  Did you want to support that?
Nov 17 2015
parent Laeeth Isharc <laeethnospam nospam.laeeth.com> writes:
On Tuesday, 17 November 2015 at 13:56:14 UTC, Jay Norwood wrote:
 I looked through the dataframe code and a couple of comments...

 I had thought perhaps an app could read in the header info and 
 type info from hdf5, and generate D struct definitions with 
 column headers as symbol names.  That would enable faster 
 processing than with the associative arrays, as well as support 
 the auto-completion that would be helpful in writing 
 expressions.
Yes - I think that one will want to have a choice between this kind of approach and using associative arrays. Because for some purposes it's not convenient to have to compile code every time you open a strange file, and on the other hand the hit with an AA sometimes will matter. The situation at the moment for me is that I have very little time to work on a correct general solution for this problem myself (yet its important for D that we do get to one). I also lack the experience with D to do it very well very quickly. I do have a couple of seasoned people from the community helping me with things, but dataframes won't be the first thing they look at, and it could be a while before we get to that. If we implement for our own needs,then I will open source it as it is commercially sensible as well as the right thing to do. But that could be a year away. Vlad Levenfeld was also looking at this a bit.
 The csv type info for columns could be inferred, or else stated 
 in the reader call, as done as an option in julia.

 In both cases the column names would have to be valid symbol 
 names for this to work.  I believe Julia also expects this, or 
 else does some conversion on your column names to make them 
 valid symbols. I think the D csv processing would also need to 
 check if the

 The jupyter interactive environment supports python pandas and 
 Julia dataframe column names in the autocompletion, and so I 
 think the D debugging environment would need to provide similar 
 capability if it is to be considered as a fast-recompile 
 substitute for interactive dataframe exploration.
Well we don't need to get there in a single bound - already just being able to do this at all is a big improvement, and I am already using D with jupyter to do things.
 It seems to me that your particular examples of stock data 
 would eventually need to handle missing data, as supported in 
 Julia dataframes and python pandas.  They both provide ways to 
 drop or fill missing values.  Did you want to support that?
Yes - we should do so eventually, and there's much more that could be done. But maybe a sensible basic implementation is a start and we can refine after that. I wrote the dataframe in a couple of evenings, so I am sure it can be improved, and even rearchitected. Pull requests welcomed, and maybe we should set up a Trello to organise ideas ? Let me know if you are in.
Nov 18 2015
prev sibling next sibling parent reply Laeeth Isharc <laeethnospam nospam.laeeth.com> writes:
On Monday, 2 November 2015 at 13:54:09 UTC, Jay Norwood wrote:
 I was reading about the Julia dataframe implementation 
 yesterday, trying to understand their decisions and how D might 
 implement.

 From my notes,
 1. they are currently using a dictionary of column vectors.
 2. for NA (not available) they are currently using an array of 
 bytes, effectively as a Boolean flag, rather than a bitVector, 
 for performance reasons.
 3. they are not currently implementing hierarchical headers.
 4. they are transforming non-valid symbol header strings (read 
 from csv, for example) to valid symbols by replacing '.' with 
 underscore and prefixing numbers with 'x', as examples.  This 
 allows use in expressions.
 5. Along with 4., they currently have  with for DataVector, to 
 allow expressions to use, for example, :symbol_name instead of 
 dv[:symbol_name].
 6. They have operation symbols for per element operations on 
 two vectors, for example a ./ b expresses applying the 
 operation to the vector.
 7. They currently only have row indexes,  no row names or 
 symbols.

 I saw someone posting that they were working on DataFrame 
 implementation here, but haven't been able to locate any code 
 in github, and was wondering what implementation decisions are 
 being made here.  Thanks.
What do you think about the use of NaN for missing floats? In theory I could imagine wanting to distinguish between an NaN in the source file and a missing value, but in my world I never felt the need for this. For integers and bools, that is different of course.
Nov 18 2015
parent reply Jay Norwood <jayn prismnet.com> writes:
On Wednesday, 18 November 2015 at 17:15:38 UTC, Laeeth Isharc 
wrote:
 What do you think about the use of NaN for missing floats?  In 
 theory I could imagine wanting to distinguish between an NaN in 
 the source file and a missing value, but in my world I never 
 felt the need for this.  For integers and bools, that is 
 different of course.
The julia discussions mention another dataframe implementation, I believe it was for R, where NaN was used. There was some mention of the virtues of their own choice and the problems with NaN. I think use of NaN was a particular encoding of NaN. Other implementations they mentioned used some reserved value in each of the numeric data types to represent NA. In the julia case, I believe what they use is a separate byte vector for each column that holds the NA status. They discussed some other possible enhancements, but I don't know what they implemented. For example, if the single byte holds the NA flag, the cell value can hold additional info ... maybe the reason for the NA. There was also some discussion of having the associated cell hold repeat counts for the NA status, which I suppose meant to repeat it for following cells in the column vector. I'll try to find the discussions and post the link.
Nov 18 2015
parent reply Jay Norwood <jayn prismnet.com> writes:
On Wednesday, 18 November 2015 at 18:04:30 UTC, Jay Norwood wrote:
 vector.  I'll try to find the discussions and post the link.
Here are the two discussions I recall on the julia NA implementation. http://wizardmac.tumblr.com/post/104019606584/whats-wrong-with-statistics-in-julia-a-reply https://github.com/JuliaLang/julia/pull/9363
Nov 18 2015
parent Jay Norwood <jayn prismnet.com> writes:
One more discussion link on the NA subject. This one on the R 
implementation of NA using a single encoding of NaN, as well as 
their treatment of a selected integer value as a NA.

http://rsnippets.blogspot.com/2013/12/gnu-r-vs-julia-is-it-only-matter-of.html
Nov 18 2015
prev sibling parent reply jmh530 <john.michael.hall gmail.com> writes:
On Monday, 2 November 2015 at 13:54:09 UTC, Jay Norwood wrote:
 I saw someone posting that they were working on DataFrame 
 implementation here, but haven't been able to locate any code 
 in github, and was wondering what implementation decisions are 
 being made here.  Thanks.
My sense is that any data frame implementation should try to build on the work that's being done with n-dimensional slices.
Nov 18 2015
parent reply Jay Norwood <jayn prismnet.com> writes:
On Wednesday, 18 November 2015 at 22:46:01 UTC, jmh530 wrote:
 My sense is that any data frame implementation should try to 
 build on the work that's being done with n-dimensional slices.
I've been watching that development, but I don't have a feel for where it could be applied in this case, since it appears to be focused on multi-dimensional slices of the same data type, slicing up a single range. The dataframes often consist of different data types by column. How did you see the nd slices being used? Maybe the nd slices could be applied if you considered each row to be the same structure, and slice by rows rather than operating on columns. Pandas supports a multi-dimension panel. Maybe this would be the application for nd slices by row.
Nov 18 2015
next sibling parent John Colvin <john.loughran.colvin gmail.com> writes:
On Thursday, 19 November 2015 at 06:33:06 UTC, Jay Norwood wrote:
 On Wednesday, 18 November 2015 at 22:46:01 UTC, jmh530 wrote:
 My sense is that any data frame implementation should try to 
 build on the work that's being done with n-dimensional slices.
I've been watching that development, but I don't have a feel for where it could be applied in this case, since it appears to be focused on multi-dimensional slices of the same data type, slicing up a single range. The dataframes often consist of different data types by column. How did you see the nd slices being used? Maybe the nd slices could be applied if you considered each row to be the same structure, and slice by rows rather than operating on columns. Pandas supports a multi-dimension panel. Maybe this would be the application for nd slices by row.
You might not build on the nd slice type itself, but implementing the same API (where possible/appropriate) would be good.
Nov 19 2015
prev sibling next sibling parent reply ZombineDev <valid_email he.re> writes:
On Thursday, 19 November 2015 at 06:33:06 UTC, Jay Norwood wrote:
 On Wednesday, 18 November 2015 at 22:46:01 UTC, jmh530 wrote:
 My sense is that any data frame implementation should try to 
 build on the work that's being done with n-dimensional slices.
I've been watching that development, but I don't have a feel for where it could be applied in this case, since it appears to be focused on multi-dimensional slices of the same data type, slicing up a single range. The dataframes often consist of different data types by column. How did you see the nd slices being used? Maybe the nd slices could be applied if you considered each row to be the same structure, and slice by rows rather than operating on columns. Pandas supports a multi-dimension panel. Maybe this would be the application for nd slices by row.
How about using a nd slice of Variant(s), or a more specialized type Algebraic type? [1]: http://dlang.org/phobos/std_variant
Nov 19 2015
parent reply Laeeth Isharc <laeethnospam nospam.laeeth.com> writes:
On Thursday, 19 November 2015 at 22:14:01 UTC, ZombineDev wrote:
 On Thursday, 19 November 2015 at 06:33:06 UTC, Jay Norwood 
 wrote:
 On Wednesday, 18 November 2015 at 22:46:01 UTC, jmh530 wrote:
 My sense is that any data frame implementation should try to 
 build on the work that's being done with n-dimensional slices.
I've been watching that development, but I don't have a feel for where it could be applied in this case, since it appears to be focused on multi-dimensional slices of the same data type, slicing up a single range. The dataframes often consist of different data types by column. How did you see the nd slices being used? Maybe the nd slices could be applied if you considered each row to be the same structure, and slice by rows rather than operating on columns. Pandas supports a multi-dimension panel. Maybe this would be the application for nd slices by row.
How about using a nd slice of Variant(s), or a more specialized type Algebraic type? [1]: http://dlang.org/phobos/std_variant
Not sure it is a great idea to use a variant as the basic option when very often you will know that every cell in a particular column will be of the same type.
Nov 21 2015
parent Jay Norwood <jayn prismnet.com> writes:
On Saturday, 21 November 2015 at 14:16:26 UTC, Laeeth Isharc 
wrote:
 Not sure it is a great idea to use a variant as the basic 
 option when very often you will know that every cell in a 
 particular column will be of the same type.
I'm reading today about an n-dim extension to pandas named xray. Maybe should try to understand how that fits. They support io from netCDF, and are making extensions to support blocked input using dask, so they can process data larger than in-memory limits. http://xray.readthedocs.org/en/stable/data-structures.html https://www.continuum.io/content/xray-dask-out-core-labeled-arrays-python In general, pandas and xray are supporting with the requirement of pulling in data from storage of initially unknown column and index names and data types. Julia throws in support of jit compilation and specialized operations for different data types. It seems to me that D's strength would be in a quick compile, which would then allow you to replace the dictionary tag implementations and variants with something that used compile time symbol names and data types. Seems like that would provide more efficient processing, as well as better tab completion support when creating expressions.
Dec 03 2015
prev sibling parent jmh530 <john.michael.hall gmail.com> writes:
On Thursday, 19 November 2015 at 06:33:06 UTC, Jay Norwood wrote:
 Maybe the nd slices could be applied if you considered each row 
 to be the same structure, and slice by rows rather than 
 operating on columns.  Pandas supports a multi-dimension panel.
  Maybe this would be the application for nd slices by row.
I meant in the sense that Pandas is built upon Numpy.
Nov 20 2015