Skip to content

Factor Support for flowWorkspace - Extended Class Approach#406

Open
mikejiang wants to merge 2 commits intodevelfrom
feature/factor-support
Open

Factor Support for flowWorkspace - Extended Class Approach#406
mikejiang wants to merge 2 commits intodevelfrom
feature/factor-support

Conversation

@mikejiang
Copy link
Member

Overview

This implementation adds factor-level preservation to flowWorkspace through new extended classes, keeping changes isolated and non-intrusive to existing code.

Problem

The C++ backend of cytoset stores pData as strings (map<string, string>), which causes factor columns to lose their level information when assigned:

pd$Patient <- factor(c("A", "B"), levels = c("C", "B", "A"))
pData(cs) <- pd
pd2 <- pData(cs)  # Patient is now character, levels lost

Solution: Extended Classes

New Classes

  1. cytoset_factors - Extends cytoset with factor preservation
  2. GatingSet_factors - Extends GatingSet with factor preservation

Design

  • Isolation: Changes are in new files (cytoset_factors.R, GatingSet_factors.R)
  • Non-intrusive: Original cytoset and GatingSet classes unchanged
  • Opt-in: Users must explicitly create/convert to *_factors classes
  • Dedicated Slot: Uses a new factor_data slot (data.frame) to store pData with factors
  • Dual Storage: Keeps R factors in factor_data and syncs string representations to C++ backend
  • Transparent: All existing methods work, just with factor preservation

Architecture

┌─────────────────┐           ┌──────────────────────┐
│ factor_data     │◄──────────│  cytoset_factors     │
│ (data.frame)    │           │  (new slot)          │
└─────────────────┘           └──────────────────────┘
                                      ▲
                                      │ inherits
                                      │
  ┌────────────┐              ┌─────────────────┐
  │  cytoset   │◄─────────────│  pData methods  │
  │  (C++ ptr) │              │  (read local)   │
  └────────────┘              │  (write both)   │
        │                     └─────────────────┘
        ▼
  C++ PDATA (strings only)
  
Factor levels stored in:
  object@factor_data

…levels in pData by delegating to flowSet's phenoData slot
@mikejiang mikejiang requested a review from djhammill February 19, 2026 21:54
@mikejiang
Copy link
Member Author

Usage

Basic Usage

library(flowWorkspace)

# Load data and convert to cytoset_factors
cs <- load_cytoset_from_fcs(...)
cs_f <- cytoset_factors(cs)

# Create pData with custom factor levels
pd <- pData(cs_f)
pd$Patient <- factor(pd$Patient, levels = c("C", "B", "A"))
pd$Visit <- factor(pd$Visit, levels = c("V3", "V2", "V1"))

# Assign - factors are now preserved in the factor_data slot!
# Strings are also synced to C++ for compatibility.
pData(cs_f) <- pd

# Retrieve - factors restored with correct levels from factor_data
pd2 <- pData(cs_f)
stopifnot(is.factor(pd2$Patient))
stopifnot(identical(levels(pd2$Patient), c("C", "B", "A")))

With GatingSet

gs <- load_gs(...)
gs_f <- GatingSet_factors(gs)

pd <- pData(gs_f)
pd$Batch <- factor(pd$Batch, levels = c("Batch3", "Batch2", "Batch1"))
pData(gs_f) <- pd

# Factors preserved in GatingSet
pd2 <- pData(gs_f)
stopifnot(is.factor(pd2$Batch))

With ggcyto (visualization)

# ggcyto automatically respects factor ordering from pData
cs_f <- cytoset_factors(cs)
pd <- pData(cs_f)
pd$Patient <- factor(pd$Patient, levels = c("Control", "Treatment"))
pData(cs_f) <- pd

# Plots will respect factor order
ggcyto(cs_f, aes(x = "CD4")) + 
  geom_histogram() + 
  facet_wrap(~Patient)  # Facets ordered: Control, then Treatment

@DillonHammill
Copy link
Contributor

Thanks for taking a look @mikejiang. I've been thinking about this and I have a new proposal that avoids the need for extended classes.

For flowSet objects we store the metadata in the phenoData slot, when we access the metadata we are actually calling pData(fs@phenoData) under the hood which converts the AnnotatedDataFrame to a data.frame preserving all data.frame properties including column classes. This is why it worked just fine for flowSets.

For cytosets, we instead store the metadata on the C++ side as character strings which is accessed using pData(cs). The cytoset object also has the phenoData slot but it is currently empty.

cs@phenoData
An object of class 'AnnotatedDataFrame': none

So we could simply store the metadata on the R side and return that to the user when pData() is called.

# c++ metadata
pd <- pData(cs)

# convert to AnnotatedDataFrame for phenoData slot
adf <- Biobase::AnnotatedDataFrame(pd)

# set factor levels
adf$Treatment <- factor(adf$Treatment)

# update phenoData slot
cs@phenoData <- adf

# factors are preserved
pData(cs@phenoData)$Treatment
 [1] Stim-A Stim-A Stim-A Stim-A Stim-A Stim-A Stim-A Stim-A Stim-B Stim-B
[11] Stim-B Stim-B Stim-B Stim-B Stim-B Stim-B Stim-C Stim-C Stim-C Stim-C
[21] Stim-C Stim-C Stim-C Stim-C Stim-D Stim-D Stim-D Stim-D Stim-D Stim-D
[31] Stim-D Stim-D NA    
Levels: Stim-A Stim-B Stim-C Stim-D NA

So all we would need to do is make the pData accessor for cytosets return the R side metadata and update the replacements methods to update the R side metadata first and then the C++ side metadata.

# pData method on cytoset extracts R side metadata
#' @export 
setMethod("pData",
          signature=signature(object="cytoset"),
          definition=function(object) {
            # old method extract c++ side metadat
            # pd = get_pheno_data(object@pointer)
            # instead extract R side metadata
            pd <- Biobase::pData(object@phenoData)
            # cn <- names(pd)
            # names(pd) <- cn[cn!=""]
            pd
          })

# pData replacement method updates R side metadata first
#' @export 
setReplaceMethod("pData",
                 signature=signature(object="cytoset",
                                     value="data.frame"),
                 definition=function(object,value)
                 {
                   # update R side metadata first
                   object@phenoData <- Biobase::AnnotatedDataFrame(value)
                   # convert to characters for C++ side metadata
                   for(i in seq_along(value))
                     value[[i]] <- as.character(value[[i]])
                   set_pheno_data(object@pointer, value)
                   object
                 })

# no changes required for GatingSet pData method
setMethod("pData","GatingSet",function(object){
  # this will now extract R side metadata automatically
  pData(gs_cyto_data(object))
})

# no changes required for GatingSet pData replacment method
setReplaceMethod("pData",c("GatingSet","data.frame"),function(object,value){
  
  fs <- gs_cyto_data(object)
  new.rownames <- rownames(value)
  if(is.null(new.rownames))
    new.rownames <- value[["name"]] #use name column when rownames are absent
  
  rownames(value) <- new.rownames
  
  # this will automatically update R side metadata then store on C++ side 
  pData(fs) <- value
  
  return (object)
})

@mikejiang
Copy link
Member Author

We don't want to change the current behavior of cs/gs, which is meant to be a pure thin wrapper around C++ them structure
attaching additional R stuff will cause confusion down the road, because users may wonder why there is information loss during saving loading these objects
This is why I deliberately creates the dedicated class to signal there are special and designed for transient lifecycle in an active RSession

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

Comments