Today I’m happy to share that Diffgram Workflow version one is now available! This is a major step forward.
Workflow is a powerful new way to work with:
- Training Data,
- Data Science modeling,
- Dataset testing, and more;
all included as part of core Diffgram.
All Training Data projects face challenges like:
- Getting new processes setup, for human tasks, imports, exports etc.
- Keeping everyone on the same page as those processes change
- Integrating with a variety of technologies in the ecosystem, for data cleaning, model training, predictions, reports etc.
Workflow solves that.
With Workflow you can define your Training Data processes in software, keep multiple stakeholders on the same page, and integrate with a growing ecosystem of tech.
Table of Contents
This article covers the following topics:
- The Big Ideas
- Included at No Extra Cost
- The Global Backdrop
- HuggingFace Implemented Example
- Workflow explained by contrast
- Beyond Modeling
- The Thought Process Behind Workflow
- UI for Everyone — Not just engineers
- Workflow Technicals
- Contrasting Workflow
- Implemented as of Today
Let’s dive in!
Conceptually, Workflow acts like a Training Data specific Orchestrator.
You line up building blocks like Tasks, Model Training, PreLabeling, Data Cleaning, etc. to work together smoothly.
Zooming in, Workflow replaces what would have been a HTTP calls to a one off script, with customizable blocks, configured directly in Diffgram.
We will keep expanding the pre-implemented blocks, and continue to make it easier and easier for you to add your own.
Over time I believe this will be a fraction of the effort of standing up and maintaining your own external service, and provide a massively better experience for yourself, your admins, annotators, and end users.
Basically before it was “figure it out yourself”. Now there’s a super clear path to go from “zero to hero” with training data, including pre-labeling, model training, sampling, etc.
We understand some competitors have chosen the route of selling Modeling (one of many parts of Workflow) as a separate service.
We have included Workflow at no extra cost as part of core Diffgram.
That means installations under Open Source license get Workflow for free, and Enterprise users may request prioritize improvements to Workflow.
There are many new dedicated teams and groups creating technology blocks for specific Training Data problems.
There continues to be an expanding volume of use cases, media types, methods, etc. The rising tide of these teams working together is better then any single implementation.
Part of what prompted is was seeing capable teams spending too many cycles getting relatively basic setups working. We realized the need for a dedicated named concept of Workflow that will wrap around Diffgram sub components and external technologies and weave them together as one.
Here’s an example of pre-implemented block. In a few minutes you can be up and running with HuggingFace Zero Shot pre-labeling your data and humans reviewing the tasks. Fully integrated.
Consider having to implement Huggingface Zero Shot yourself.
How would you do it? 🤔
Well without workflow you would have to do some variation of the following:
- Implement some form of ingestion or data collection to get the data to your script
- Define and load your Schema
- Implement the actual Huggingface script
- Store the data somewhere
- Integrate with your Training Data system to send that data for human review
After all that, problems remain,
- How would you share the original script with your teammates?
- If a non-technical person changed the Schema, how would you know?
- When Hugging Face comes out with a new method, how would you upgrade it?
- How do you report on the progress of this?
- How do you know when the script is done running?
And that’s really just the tip of the iceberg.
Where as with Diffgram Workflow you are reusing existing concepts
- Workflow triggers based on events, such as new data being loaded, tasks being completed, and yes, you can still manually trigger them too.
- All the existing import concepts and connections methods are available. (Pass by reference for stored data for example)
- Workflow loads the Schema automatically. Non-technical users can manage the schema, and even use multiple Schemas.
- Workflow has training data technologies already implemented (Huggingface Zero Shot in this case, many more coming).
- Workflow stores the data as Diffgram Annotations, which is queryable by Diffgram Catalog. Basically you get the storage and query integration “for free”.
- Workflow integrates smoothly with next steps, like creating human tasks.
- Workflow is defined in software and overall flow is clearly surfaced and visible to non-technical users.
- Workflow even has integrated reporting.
Workflow turns what would have been days or even weeks of upfront work, plus unknown amount of maintenance, into a few minutes.
And in the cases where that pre-built block doesn’t exist, it still saves the vast majority of effort. Essentially only needing to do the concrete actual implementation of the method and leaning on all the other existing concepts.
Deepchecks is an open source library that allows you to test and analyze both your models performance and your data quality.
Deepchecks offers several tools and utilities to better monitor the performance of your machine learning models. Check out Deepchecks in Diffgram Workflow.
To illustrate the point that Workflow is UI centric, consider this Tasks step in workflow.
On the left is the configuration screen. On the right is the “active” screen.
The “active” screen is actually entire normal tasks module. That’s right, you can actually use the workflow, as if you would use tasks.
So for example if I click the Insights tab I can manage that workflow directly.
This may seem fairly obvious, but it’s a big step forward in terms of surfacing these processes, and being able to work with them directly.
When working with training data there are a lot of interesting interplays between people and technology.
For example, a human review task is done one at a time.
However, you may not want to train a model until a group of tasks is complete. Often the first reaction here, on the engineering side, is to think in terms of “pipelines”. Pipelines come with a lot of assumptions
- Requires highly technical audience to setup
- Setup infrequently then run
- Hard for non-technical audience to engage meaningful, even with pre-configured pipelines
Workflow is different, built for Training Data from the ground up
After some reflection we realized a few things:
- Workflow needed to be understandable by a project admin, not just engineering
- Each block may be interactive if applicable — blocks more then “nodes”.
- Each block must be able to consider relevant conditional and aggregate concepts as events roll in.
The way these workflows get created is very dynamic. For example a data scientist may want to query and explore some data, then move it around, compare things etc. A project admin may want multiple new sets. This isn’t just a one time thing by a pure engineering team.
Workflow is built to scale. Workflow is built to be expandable and maintainable.
- Workflow is backed by a dedicated queue service that uses any AMPQ compatible provider, with our reference installation using Rabbit MQ.
- Workflow can deploy with separate resources, as specified in helm chart, from the rest of Diffgram. For example assigning GPU resources to Workflow.
- Workflow can still call external providers, so tasks that require specific resources or scaling parameters outside the scope of the Workflow service can still be used.
- Workflow leans on the Diffgram Connections paradigm. This means connections to 3rd parties (google, aws, azure, etc) are available to Workflow.
Conceptually, we are continuing to make it easier and easier to integrate and create new Actions. This is a much stronger level of integration then the SDK, as database queries/writes can be made directly, and assumptions around event triggers etc are much stronger.
Over time we picture that Workflow will be:
- Powerful enough out of the box to use as Internal “Apps”
- Customizable to work with any of your processes that surface to External apps shipped by your team
Replaces Manual Scripts
Workflow replaces what would have been totally manual, or a hodge podge of scripts cobbled together.
Works with Other Orchestrators
Existing ML/AI orchestration flows, or general purpose pipelines are complimentary to Diffgram Workflow.
Better then Implement a Fit() function Yourself
- Pre-built blocks: for data science modeling,
- Composable ways to build your own: Workflow has an array of built-in concepts, from UI selectors, event system, Schema, Auth, and so much more.
- And goes far beyond just training a model: With Debugging and other training data tech. Workflow surfaces processes to non-technical users. It’s so much more then just the technical side, it’s a new way to work with your training data and bring all of your team together.
Add-On Model Training Products
Workflow is more flexible, more cost effective, and provides more value then Labelbox Model or SuperAnnotate consulting.
- More flexible, works with more methods. Instead of vendor lock-in, explore multiple training data technologies (including AutoML / model training), and easily switch as new ones become available.
- Workflow is more cost effective because you use your own hardware (e.g. with open source models) and preferred 3rd party providers. No extra consulting layer. And Workflow itself comes at no additional cost, it’s included in part of core Diffgram.
- Workflow goes beyond the model, as mentioned in the Fit() explanation. Workflow surfaces the overall business process itself and making those processes defined and accessible to non-technical and remote users. It’s not just about literally training a model and getting predictions etc, but all of the work around it.
Workflow helps answer questions like: What schema is being used? Who is working on it? What does the overall flow look like? As a non-technical user, how do I change the Schema of a technical integration someone else setup?
Already implemented are:
- Human Tasks
- HuggingFace Zero Shot
- Webhook Events
- Google VertexAI Model Training* (coming very soon)
- V1 of CLI for creating your own actions.
The first step is to install Diffgram.
We welcome improvements ideas, pull requests and more. Join our Github and slack community.
As part of the Enterprise license you can get access to services to:
- Prioritizing integrations with your desired technology (e.g. Azure, AWS, ABC Vendor etc)
- Helping your team build their own Actions
- Prioritizing relevant feature requests to improve Actions
- Unlimited use of Workflow through commercial Enterprise License
Thank you for reading!