I think you may have a fundamental misunderstanding for the value Snowflake brings to the Data Lake use case. There is a reason why Snowflake no longer calls itself a "Cloud Data Warehouse" because that term is overloaded and can confuse people about the workloads Snowflake can take on. Some the earliest and largest Snowflake wins were really Data Lake use cases. Companies with massive amount of semi-structured files struggled to query them at scale using Hadoop, and Snowflake came along and made it easy. If you have a million JSON or Parquet files and want to query them performantly with SQL, name a better technology than Snowflake to do that today. I'd frame the difference as Snowflake allows you to meet Data Lake and Data Warehouse use cases in the SAME TECHNOLOGY. Using SQL. Without having to create indexes or do performance tuning. The only difference is, Snowflake gleans statistics about the files as it ingests them so you can query the files as is. Describing this as "two tiers" is fundamentally inaccurate--there is no logical difference for a company "loading" a semi-structured file into a "Data Lake" and loading that same exact file into a Snowflake managed data lake. None. In the US pricing is $20-$23/compressed TB on contract depending on your Cloud provider, and compression is excellent, so you get a reasonable storage price while still being able to query ANY of your data at speed.
Beyond that, even if you DIDN'T want to load the data into Snowflake for some reason, and you DID want to maintain a "two-tier" architecture, Snowflake offers a host of features (external tables, streams on external tables, materialized views on top of external tables, etc.) that can provide usability and performance even in that case. Now that Snowflake can load unstructured data (the feature is in preview, but announced), there aren't many data lake use cases that Snowflake can't handle in a world-class way.
Stephen, you're a sales engineer at Snowflake, you're just regurgitating the Snowflake FUD, so much is misleading in that post that it's hard to know where to begin. The truth is that the vast majority of the data in the clouds sits on Data Lakes, if they can make use of it right there, why move it out to the second tier called Snowflake? Especially since it'll be locked in and only queryable by SQL? Especially since there is zero machine learning capabilities there. Sure, they can access it using external tables, but now performance is bad and it's read-only. Snowflake is just a Data Warehouse... a very good one, almost as good as BigQuery (e.g. look at BQML, there is no such thing in Snowflake). But it's not more than that.
What's FUD about what I said? Jamin stated that Snowflake is at a disadvantage because it has a "two tier" architecture and that just plain isn't correct. Just because there is a lot of data sitting in Hadoop or alternative data lake structures, doesn't mean it correct to do so anymore. Maybe 5 years ago, but not now. What's true is Snowflake is sweeping up huge amounts of what I'd consider data lakes (migrations from Hadoop, or just data dumped in the cloud object stores), in many cases because they weren't successful or were too expensive to maintain.
On the machine learning front, Snowflake has a ton of machine learning partners and the best of them can read and write natively from Snowflake via SQL. No Spark required, but you can use Spark if you want. If your machine learning product of choice can't read and write SQL, or use native Python or Spark connectors for Snowflake, that sounds like a problem with the ML solution, not Snowflake.
AWS is doing great stuff with Sagemaker, plus Dataiku, Data Robot, H2o.ai, Zepl, and more--so many leading AI/ML tools work great with native connections to Snowflake, there is just no reason to maintain a separate silo for "data lakes" different from other analytical areas. The disadvantage of having a separate silo on different technology far outweighs other considerations. I stand by my not controversial comments.
(By the way, regarding performance of external tables, materialized views on top of external tables performs at essentially the same level as if you had loaded the data into Snowflake if you really wanted to go down that route.)
Love the coverage on the data space! Curious what you meant by "I'd love to see Snowflake find a way to better separate / charge for cold storage." as snowflake stores data on S3 and the cost is the same. The bulk of the cost ends up coming from compute which is the key piece to scaling out large data infrastructures with many analytical transformations. Where I've seen storage become a little more expensive is if you choose to store your data in a data lake before moving to snowflake. You then double your storage cost with the redundancy, but S3 storage is relatively cheap. Either way it is awesome to see where both companies are moving. It would be awesome to get to where both data science and BI/analytics can be powered by a single (ware/lake)house rather than needing to have one for each.
I wouldn't be too sure about that. What I see is 80% of a typical Databricks workload is really data engineering with Spark. Given that, you could say that Databricks is really a custom ETL company masquerading as an ML company. Now that Snowpark for Python in GA, you'll begin to see the case studies of customers converting PySpark data engineering workloads natively to Snowflake. I saw a case yesterday of a 20 hour Spark job taking less than 20 minutes in Snowpark. Same job, same data. Similarly, AMN Healthcare did a case study recently that showed a 93% savings vs Databricks. Overall, if Snowflake can run a PySpark job at 10% of the cost with better performance, it will be a very compelling move for customers.
I would. Everyone is complaining about Snowflake costs, not Databricks. True or not, perception matters. Humans listen to other humans, and those humans are running around complaining about Snowflake costs.
I certainly hear Databricks trying to amplify that message, but most customers understand that a) Snowflake has a gotten more cost efficient over time (improved compression, faster warehouses passing through those costs) and b) there are many native guardrails available (resource monitors, budgeting, consumption dashboards, etc.). By contrast, Databricks hides a lot of their costs in the Cloud provider bill--there is no single place to see your Databricks spend. All that said, customers can test for themselves--if a customer tests their PySpark on Snowflake and see it costs 10% of the price of Databricks and it runs faster with much less management, they will move. If they see no improvement, they won't. The best technology will win.
Might In-Memory Computing threaten Snowflake's ability to scale compute and storage separately? I'm more interested in the kind of IMC where compute is close to memory on the same chip, than the software kind. I just know a little about that from video from The Linley Group on Youtube. I don't have a background in tech, so 1) I hope the question makes sense, and 2) I appreciate the good free info from Clouded Judgement.
That proved to be a timely post, Jamin. Thank you. I would love to see an update of that post incorporating all the annoucement of Snowflake and Databricks' conferences 1-2 weeks ago.
I think you may have a fundamental misunderstanding for the value Snowflake brings to the Data Lake use case. There is a reason why Snowflake no longer calls itself a "Cloud Data Warehouse" because that term is overloaded and can confuse people about the workloads Snowflake can take on. Some the earliest and largest Snowflake wins were really Data Lake use cases. Companies with massive amount of semi-structured files struggled to query them at scale using Hadoop, and Snowflake came along and made it easy. If you have a million JSON or Parquet files and want to query them performantly with SQL, name a better technology than Snowflake to do that today. I'd frame the difference as Snowflake allows you to meet Data Lake and Data Warehouse use cases in the SAME TECHNOLOGY. Using SQL. Without having to create indexes or do performance tuning. The only difference is, Snowflake gleans statistics about the files as it ingests them so you can query the files as is. Describing this as "two tiers" is fundamentally inaccurate--there is no logical difference for a company "loading" a semi-structured file into a "Data Lake" and loading that same exact file into a Snowflake managed data lake. None. In the US pricing is $20-$23/compressed TB on contract depending on your Cloud provider, and compression is excellent, so you get a reasonable storage price while still being able to query ANY of your data at speed.
Beyond that, even if you DIDN'T want to load the data into Snowflake for some reason, and you DID want to maintain a "two-tier" architecture, Snowflake offers a host of features (external tables, streams on external tables, materialized views on top of external tables, etc.) that can provide usability and performance even in that case. Now that Snowflake can load unstructured data (the feature is in preview, but announced), there aren't many data lake use cases that Snowflake can't handle in a world-class way.
Stephen, you're a sales engineer at Snowflake, you're just regurgitating the Snowflake FUD, so much is misleading in that post that it's hard to know where to begin. The truth is that the vast majority of the data in the clouds sits on Data Lakes, if they can make use of it right there, why move it out to the second tier called Snowflake? Especially since it'll be locked in and only queryable by SQL? Especially since there is zero machine learning capabilities there. Sure, they can access it using external tables, but now performance is bad and it's read-only. Snowflake is just a Data Warehouse... a very good one, almost as good as BigQuery (e.g. look at BQML, there is no such thing in Snowflake). But it's not more than that.
What's FUD about what I said? Jamin stated that Snowflake is at a disadvantage because it has a "two tier" architecture and that just plain isn't correct. Just because there is a lot of data sitting in Hadoop or alternative data lake structures, doesn't mean it correct to do so anymore. Maybe 5 years ago, but not now. What's true is Snowflake is sweeping up huge amounts of what I'd consider data lakes (migrations from Hadoop, or just data dumped in the cloud object stores), in many cases because they weren't successful or were too expensive to maintain.
On the machine learning front, Snowflake has a ton of machine learning partners and the best of them can read and write natively from Snowflake via SQL. No Spark required, but you can use Spark if you want. If your machine learning product of choice can't read and write SQL, or use native Python or Spark connectors for Snowflake, that sounds like a problem with the ML solution, not Snowflake.
AWS is doing great stuff with Sagemaker, plus Dataiku, Data Robot, H2o.ai, Zepl, and more--so many leading AI/ML tools work great with native connections to Snowflake, there is just no reason to maintain a separate silo for "data lakes" different from other analytical areas. The disadvantage of having a separate silo on different technology far outweighs other considerations. I stand by my not controversial comments.
(By the way, regarding performance of external tables, materialized views on top of external tables performs at essentially the same level as if you had loaded the data into Snowflake if you really wanted to go down that route.)
Love the coverage on the data space! Curious what you meant by "I'd love to see Snowflake find a way to better separate / charge for cold storage." as snowflake stores data on S3 and the cost is the same. The bulk of the cost ends up coming from compute which is the key piece to scaling out large data infrastructures with many analytical transformations. Where I've seen storage become a little more expensive is if you choose to store your data in a data lake before moving to snowflake. You then double your storage cost with the redundancy, but S3 storage is relatively cheap. Either way it is awesome to see where both companies are moving. It would be awesome to get to where both data science and BI/analytics can be powered by a single (ware/lake)house rather than needing to have one for each.
Databricks wins the ML war. Snowflake wins the "we used to have SQL server, now we want something else war."
Snowflake is for SQL companies, more advanced companies and use cases require DataBricks.
I wouldn't be too sure about that. What I see is 80% of a typical Databricks workload is really data engineering with Spark. Given that, you could say that Databricks is really a custom ETL company masquerading as an ML company. Now that Snowpark for Python in GA, you'll begin to see the case studies of customers converting PySpark data engineering workloads natively to Snowflake. I saw a case yesterday of a 20 hour Spark job taking less than 20 minutes in Snowpark. Same job, same data. Similarly, AMN Healthcare did a case study recently that showed a 93% savings vs Databricks. Overall, if Snowflake can run a PySpark job at 10% of the cost with better performance, it will be a very compelling move for customers.
I would. Everyone is complaining about Snowflake costs, not Databricks. True or not, perception matters. Humans listen to other humans, and those humans are running around complaining about Snowflake costs.
I certainly hear Databricks trying to amplify that message, but most customers understand that a) Snowflake has a gotten more cost efficient over time (improved compression, faster warehouses passing through those costs) and b) there are many native guardrails available (resource monitors, budgeting, consumption dashboards, etc.). By contrast, Databricks hides a lot of their costs in the Cloud provider bill--there is no single place to see your Databricks spend. All that said, customers can test for themselves--if a customer tests their PySpark on Snowflake and see it costs 10% of the price of Databricks and it runs faster with much less management, they will move. If they see no improvement, they won't. The best technology will win.
Hi, sorry, I have a silly question to ask.
I have economics background, nothing to know about IT. I have no clue about vast terminology in IT.
But I am interested to learn since I see compounded growth of this business (stocks) in recent decades.
Do you have recommendation, where should I start to learn?
Thanks for writing!
Jamin, great article, liked the way you dissected this market! thanks.
Very interesting, thank you for writing!
Might In-Memory Computing threaten Snowflake's ability to scale compute and storage separately? I'm more interested in the kind of IMC where compute is close to memory on the same chip, than the software kind. I just know a little about that from video from The Linley Group on Youtube. I don't have a background in tech, so 1) I hope the question makes sense, and 2) I appreciate the good free info from Clouded Judgement.
That proved to be a timely post, Jamin. Thank you. I would love to see an update of that post incorporating all the annoucement of Snowflake and Databricks' conferences 1-2 weeks ago.