February Community Newsletter – Data Engineering: Get to the Crux

Hello from the CEO

Philip Brittan, Crux CEO

Delightful data is useful and useable. Data Scientists make data useful through analysis that extracts valuable insights from the data. But first, Data Engineers make that data usable by whipping it into shape: loading it, cleaning it, normalizing it, mapping it, joining it, and other transformations that get the data ready for Data Scientists to wring value out of it.

While Data Science gets the headlines, Data Engineering is working hard behind the scenes to make the Data Science magic possible. And by working hard, I mean that Data Engineering typically accounts for 70-80% of the total effort a firm spends on making use of data. Data Science and the unique insights it delivers are business differentiators, but most firms spend a minority of the time on them.

That’s why forward-looking companies increasingly turn to a partner like Crux. By offloading their Data Engineering work, these companies give more time and energy to Data Science and move much more quickly to produce valuable new insights that power their businesses.

Crux brings laser focus, deep expertise, operational oversight, and a valuable network of data suppliers to help you orchestrate, implement, and operate your information supply chain.

At Crux, we make data delightful.


Crux Insights Blog

How can you keep the right data flowing into your business? It is simple: Orchestrate, Implement and Operate. Read about Crux’s three step process in our last blog post. 



Five in 5 with Head of Data Engineering Andrew Clark

Andrew Clark is Crux’s head of data engineering with a tall ask. At 6’6” he sees the full spectrum of data needs for Crux clients. With deep experience in managing unstructured data, he’s a master of data transportation, storage and repackaging.  Here are five questions in 5 minutes with Andrew:


What does a data engineer do?
At Crux, being a data engineer means handling the tough work that makes data more actionable for our clients, and designing the tools that make our clients’ lives easier over time. Data engineers sit on the “data wrangling” side of the pipeline, meaning we are the folks who handle the hard work of figuring out where certain elements of the dataset live, slicing and dicing data, and repackaging it for distribution.


How has the data engineering landscape changed in the last 5 years?
Today, the folks managing information supply chains are embracing the fact that the whole process does not need to exist on-premises anymore. While firms used to believe their data engineering was their “secret sauce”, today they realize it’s the insights they can glean that are more important. Using experts like Crux to remove as much of the tedious, upfront work as possible is now the preferred model.


What are you most excited about?
At Crux, we’re illustrating the art of the possible for our clients. What was once difficult has now become easy. Helping clients realize the full potential of their data is truly exciting.


What do you do when you are not engineering data?
I am a big outdoorsman, so my favorite activities tend to be outside. I am an avid cyclist, I have a motorcycle and am currently building an airplane.


What would geolocation data tell us about you?
If you were to assess my geolocation data, you’d probably find that when I am not working, I like to go to places where the population density is low. This means you’ll probably find me on my bicycle, hiking, or somewhere outdoors and away from the city.


Crux Community

Is it difficult to get access to useable data? Let Crux experts engineer your data to make it ready to use. Our data engineers take on your data challenges so that you can spend your time finding signals. Click HERE  to chat with our team of experts.

Have data to share? Our data supplier community is growing by leaps and bounds.  Our diverse datasets range from stock quotes to corporate trends to transportation data and more.  No data is irrelevant.  Create a Crux login HERE to browse our network and become a supplier.

Out and About

We’ve been building our community. In the past month, we’ve met with hundreds of suppliers and buyers of alternative data.


Quandl Alternative Data Conference | January 18, 2018
New York, NY


Battlefin Discovery Day Miami | January 30-31, 2018
Miami, FL


Outsell Data Money | February 1, 2018
New York, NY


AI in Fintech Forum | February 8, 2018
Stanford University, Stanford, CA

Orchestrate, Implement, Operate


In my last blog post, I talked about how informatics firms help companies ‘orchestrate, implement, and operate’ their information supply chains.  What exactly do I mean by that?  As an Informatics firm, this is what Crux does:


Orchestrate:  ‘Orchestrating’ in general means pulling together and coordinating a variety of components to work together effectively, the way the conductor of an orchestra makes sure the individual musicians are playing together effectively to bring the music to life. The first step in creating a supply chain is deciding the elements that need to go into it. This is driven by the use case of the consuming customer (hedge fund, bank, insurance co, etc). What data do they need, and in what form do they need it?  Crux works with a supplier network of partners: data publishers, analytics firms, and service providers who form the components of the supply chain that Crux implements and operates.  In some cases, a consumer may have a specific dataset or vendor that they know they want to work with.  In some cases, the consumer only knows the type of data they want and they look to Crux to help them surface potential providers of that data and possibly to run tests on candidate datasets to objectively test the fitness of that data to the customer’s use case.  Crux works with a wide range of tools and 3rd party service providers and pulls them into the appropriate set to meet the needs of the specific supply chain.  For instance, there may be a specialist who transforms the data in some specific way (akin to a ‘refiner’ in my last blog post).  Crux partners can make themselves visible to clients on the Crux platform so that customers can browse and learn about specific datasets, analytics, and services, get inspired, and express interest in exploring any of them more deeply.


Importantly, Crux does not sell or resell any data or analytics itself — producers and consumers can count on Crux being an objective neutral partner and producers have full control over where their data goes.  Providers license their content directly to customers and Crux acts as a third party facilitator to wire up and watch over the data pipelines, on behalf of customers, as described below.


Implement:  A supply chain fundamentally involves the flow of goods from producer to consumer. In the physical goods world (traditional logistics), that involves transportation, storage, and (potentially) repackaging. In the case of an information supply chain, it involves the transportation, storage, and repackaging of data.  These are the fundamental data engineering tasks that allow data to flow between parties in a way that is maximally actionable for the consumer.  These data engineering tasks generally involve writing software that ingests the data (maybe picks up FTP files, copies from an entitled S3 bucket, legally scrapes a web site, hits an API, etc.), validates it (look for missing, unrecognizable, or erroneous data), structures it (usually into one or more database tables), cleans it, normalizes it, transforms it, enriches it, maps embedded identifiers, joins it with other data, removes duplicate entries, etc., all to support the specific use case of the customer.  This is the kind of data engineering work that Crux does to implement a specific supply chain for a customer, pulling in the appropriate data providers, tools, and value-added service providers identified in the Orchestration phase.


Operate:  Rarely is a dataset static. The vast majority of datasets receive regular updates, whether that’s once a month, or once per millisecond.  As that data flows, constant vigilance is needed to make sure data shows up when it is supposed to, that it’s not missing anything, that it doesn’t contain unidentifiable components.  Data Operations includes the monitoring and remediation of ongoing data streams.  Crux Data Operators set up dashboards and alerts to keep a close eye on data in motion and all the systems it travels through. When a problem is spotted, they immediately begin diagnosing and remediating the issue, in tight collaboration with the relevant data provider(s), to try to get ahead of the issue before it affects downstream consumers.  Data Operations also includes handling standard maintenance tasks such as watching for and reacting to data specification changes and scheduled maintenance outages coming from the data provider(s).


These are the key elements of Information Supply Chain Logistics in a nutshell.  It is a rich process and gives customers tremendous leverage in harnessing the integrated value of a network of suppliers.


Contact Crux if you’d like to learn more.

Informatics Firms and Information Supply Chains

Philip Brittan, CEO of Crux Informatics, Inc.

One of the most revolutionary steps in the evolution of manufacturing has been the emergence of sophisticated supply chains. To understand them, first imagine how a person or a firm could create a new product by gathering raw materials and making all the parts themselves. Then imagine how pieces of that process are picked up by others who specialize in various ingredients that go into creating the finished products, such as raw materials providers, tools makers, and (eventually) component manufacturers who create standardized subsets of a product that can be assembled by multiple downstream firms to produce different end products.

Over time the raw materials become more refined (planed lumber instead of timber, jet fuel instead of crude oil, steel instead of iron), and the refiners may in fact be separate companies in the supply chain who take in raw materials and output refined materials, perhaps in several steps by several companies. Over time, tools become more sophisticated and specialized, consuming materials and tools from their own supply chains. Components become increasingly complex and comprehensive (producing larger assemblages), again consuming materials, tools, and possibly sub-components from upstream. With this evolution, manufacturing supply chains have become exceedingly sophisticated and complex, with literally thousands of companies working together to build a car, for example.

One of the key innovations needed to allow this is standards. Thanks to accepted and widely used industry standards, a screw firm can specialize in making screws for a large number of downstream firms, without each screw being a bespoke project. That specialization/focus, and the automation that’s possible when manufacturing standardized components, drives economies of scale and advances in efficiency.

Along with physical goods, supply chains eventually also come to encompass value-added services, such as consulting, metrics gathering, supplier ratings, etc. A special kind of service provider associated with supply chains is the Logistics company. The Wikipedia definition of supply-chain logistics explains that “logistics is the management of the flow of things between the point of origin and the point of consumption in order to meet requirements of customers”. Logistics firms help companies orchestrate, implement, and operate their complex supply chains. They generally work with a network of suppliers that they can bring to bear when helping a firm set up a supply chain. And they have the skills and tools to make sure that the supply chains are operating smoothly, which in the physical goods world frequently involves planning and arranging efficient transportation and storage.

In information intensive industries, such as financial services, processing information to drive valuable insights is the core “manufacturing process”. For example, financial firms of all kinds—banks, hedge funds, research houses, private equity firms, insurance companies, etc.—all take in relevant information about the world, process and perform analysis on that information, drive insights, and take action on those insights. That action can take many forms—make a loan, place a trade, rebalance a portfolio, pitch a client, author a research report, buy a company, underwrite a policy, etc, depending on the type of firm—but all firms have at their core that critical process of gathering information and performing analysis to drive insight.

Over time, the range of information that firms utilize in this core process has grown in volume, velocity, and variety. As such, firms have started to move beyond simply collecting raw material (data), to thinking about their information supply chains, an evolution that closely mirrors what we have seen in manufacturing industries. We are witnessing rapid evolution in the tools that are available to companies to process and analyze data. And a large variety of suppliers, in the form of ‘alternative’ data vendors, have sprung up to meet the ever-expanding needs of financial firms to feed their insight generation processes. One interesting feature of information supply chains is that they may be looping, meaning company A may produce some data (perhaps exhaust from a trading system), feed it to one or more refiners, aggregators, or derived-data producers, who then feed their output back to company A to use in their analytics.

These information supply chains are getting more complex and thus harder to manage, yet—to date—financial firms have generally managed them themselves. This has led to inefficiencies and redundancies across the industry. Every firm has had to become at least basically competent in data management, many have built some form of in-house platform (some well, some poorly) to help manage their data flows. And we are left with a situation where hundreds (in some cases thousands) of firms are wiring up to the same sources of data, downloading the same data, storing the same data, cleaning the same data, mapping the same data etc, independently, redundantly, with no economies of scale.

Just as Logistics firms arose to help manufacturing firms manage their increasingly complex and burdensome supply chains, a new type of firm—Informatics firms—are an inevitable evolution of the market to help companies manage their information supply chains. Informatics firms help companies discover relevant sources of data and help them evaluate that data for fitness to the needs of the firm. They implement and operate the data processing pipelines that are needed to get the information from the supplier to the customer, while validating, cleaning, transforming, mapping, and enriching the data along the way (what we might call Data Engineering) so that it arrives at the customer in a form that is immediately actionable, meaning a firm can do something with it that is pertinent for their business (what we might call Data Science), as is, without requiring further refinement. With a supply chain mentality, Informatics firms pull in the right tools and partners to get the job done.

In effect, Informatics firms ‘manage the flow of information between the point of origin and the point of consumption in order to meet requirements of customers’. Informatics firms can bring economies of scale to the industry by wiring up to a specific source of data once, storing that data once, cleaning that data once, mapping that data once, on behalf of many clients, who can share the costs of those things rather than bearing them independently and redundantly. Informatics firms can also help with the broad implementation of industry standards, which allows for more automation and greater efficiency for everyone.

Firms in information-driven industries, such as financial services, need to think of their core data and analytics workflow as their ‘manufacturing’ process and they need to think about the content that feeds that process as their critical supply chain. As they do so, Informatics firms can help them orchestrate, implement, and operate those supply chains more effectively and efficiently.