Can an Iceberg table be simultaneously listed in Hive and Dremio (Nessie)? -- Use case: Amazon Athena and AWS Glue do not know about Nessie, they use Hive. However I have Dremio in AWS and want the data as code experience. How do we make that happen? Answer: Technically, you can have an Iceberg table stored in two different catalogs, such as Hive and Nessie. But this isn’t recommended or a safe option as you would be storing the current state of the table (the metadata pointer) in two different places. There can be scenarios where you are updating the table in 2 different places and different readers accessing the same table can see different data.
For the 2nd part of the question where you would like to use Nessie as a catalog for data-as-code approaches, Athena and Glue (ETL aspect) don't support Nessie yet. So, you would not be able to use this combination safely.
However, if you would like to specifically leverage data-as-code, you can currently use Glue with Spark and Spark supports Nessie.
Does Nessie support custom iceberg catalogs? Answer: Nessie itself is a catalog (a type of custom catalog implementation) to be used with Apache Iceberg.
Is Nessie catalog support in dremio-oss docker container? It doesn’t seem to work Answer: As of now we understand it doesn’t work. We have that in our roadmap and hopefully should have it in the future.
Does Nessie support multi table data consistency commits? Answer: Yes, Nessie supports multi table data commits. You can take advantage of things such as isolated branching, git-like commits, data versioning, etc. to work with multi tables and have those consistencies.
I thought Iceberg kept track of all commits inside itself, not the catalogs. Answer: That is correct. Iceberg keeps track of all the commits inside itself with the metadata files. However, catalog is a crucial part of that - it keeps track of the current state of the table. e.g. let’s say you write something to a table and the current state is v3.metadata.json. But say the Glue catalog has the current pointer as v2.metadata.json and Nessie has v3 (depending on the time difference between writes). So, that split second difference in writes can have a huge impact on the read side. Also, in the write side, there might be additional consistency issues (say someone creates a v3.metadata.json from Glue side, the linear history won’t be maintained).
Can Iceberg Tables be stored on an NAS? Answer: Technically, as long as it's a file system (local, HDFS, object stores), yes we can. The most important thing is the Catalog (which brings in those atomicity guarantees).
Can we use 1 catalog for writing (say AWS Glue) and 1 catalog for reading (say Snowflake)? Answer: As of now, we don’t think there is an open-source catalog on the Snowflake side (it's specific to their internal). But again, storing the same Iceberg table in 2 different catalogs is not safe and can result in inconsistencies, i.e. commits may not be reflected. So, Snowflake can still show old data. Also, to add, as of now, we don’t think (based on the docs), Snowflake supports any other catalog (other than their internal).
What is the right combination Catalog and type? Right now RestCatalog and type hadoop breaks with AWS s3. Cannot load Spark generated Iceberg tables in Dremio. Answer: Dremio doesn’t support REST catalogs yet. So tables created on REST catalogs won’t be accessible in Dremio. The other consideration is that you can write Iceberg tables with Hadoop catalog in S3 using Spark, and you can use them in Dremio (promote them) but it might not be a safe operation as S3 doesn’t support atomic swaps and therefore you need a catalog. A common option is Glue.
Snowflake Iceberg Table doesn't support partitions, so I am forced to use some other catalog where the Iceberg .parquet is partitioned. Now because my EDW is in Snowflake I have to have the Iceberg catalog in Snowflake. Now my workflow is I use Hive metastore and copy the files over to S3 (It is ugly, but ok for PoC)? Answer: We think as of now this makes sense as there are limitations on the Snowflake side (with the catalog support). But we are also pretty sure they would be working towards supporting some major catalogs like Glue or Hive, etc. For a POC this should be alright.
Are there any benchmarks that I can look at for update and read performance with iceberg - be it using spark, Athena, dremio etc as the compute engine? Curious how the performance compares between a data warehouse(snowflake) and a data lake house(Dremio)? Answer: As of now, we don’t have one but we are currently looking at this. One thing to add here is that the engine's performance comparisons would really depend on the type of workloads, etc. Like Spark is built as a replacement for Mapreduce jobs and works great for long running jobs (with things like resiliency, etc.). But it isn’t great for interactive queries. Whereas, Dremio shines on interactive queries with Iceberg. So, it would depend on the use cases too.
For performance comparisons between the two architectures, we are working on those lines right now. So, we should have that soon.
For multi table transaction consistency the Nessie catalog must be used only Answer: That is correct. As of now Nessie is the only catalog to support multi table transactions.
Our team is currently using Delta Lake but considering switching over to Iceberg, how frequently do you see customers make that move (and why do they move) and how long does it take on average (obviously dependent on a few different variables)? Answer: We have seen a lot of interest in this migration with our customers and prospects (on the Dremio side) and a couple of them may already have done it. But yes, there are a couple reasons why people want to move. One of the main reasons is that they see Delta Lake as another lock-in of their data like other vendors like Snowflake. There have been talks about open sourcing all the capabilities of Delta but there are still a lot of things that aren’t really available outside their commercial offering.
On top of that, customers also have been seeing huge adoption of Iceberg, the community diversity and its compatibility to work with multiple engines and data being open, so that’s a big driving factor. On the feature side, there are a couple of features that are not in Delta lake, such as Merge-on-read(MOR) and Hidden partitioning (where you don’t need to create additional columns and keep track of them). Delta does address this in some way (by creating extra columns) but there are significant advantages with Iceberg’s way of hidden partitioning.
I get this error, IllegalArgumentException: Cannot create catalog demo, both type and catalog-impl are set: type=hadoop, catalog-impl=org.apache.iceberg.rest.RESTCatalog Answer: RESTCatalog cannot be of the type hadoop. I think when you configure a REST catalog, there is no need to specify the type. More info here: https://tabular.io/blog/rest-catalog-docker/
PS: Dremio doesn’t work with the REST catalog yet. The above example is with Spark.
I gather the only realistic way to load Dremio using Nessie catalog is with Spark + support libraries. How do people use the Iceberg tables then with other tools? It sounds like they would need to constantly tell their external tools to recheck the Iceberg tables for the latest commit? Is that right? Answer: There are a couple of aspects here. If you are specifically looking at using Dremio with Iceberg tables and using Iceberg tables with other tools, Hive & Glue catalogs are majorly supported by other tools. Nessie’s support in other tools are also currently being worked on like - Trino and Presto already supports. But as of now, if you want to leverage Nessie (or Dremio Arctic - the service), Spark & Dremio are the options and we should see more support coming as per the needs soon. On the streaming side, Apache Flink also works with Nessie.
Glue is AWS, we want only onPrem stuff Answer: Currently, we have the Hive catalog (Hive metastore on prem) and that can be used to connect by other engines. Also, Nessie can be used to run on-prem with a number of backends such as Postgres, Mongo, Dynamo. Read more on Nessie here. And then you can use engines such as Spark. Additionally, Hadoop catalog can be used in the HDFS but with some kind of locking mechanism to achieve those atomicity guarantees.
Do you see impact performance between delta lake, Apache Hudi and Iceberg on S3? Answer: They should pretty much be the same but can depend on a few factors (mostly use case based). For, e.g. Iceberg has Merge-On-Read whereas Delta doesn't and so if you have a lot of streaming updates and for things such as equality-deletes, Iceberg is going to be super fast for the write-side (vs read-side tradeoffs). So it would depend on your specific use case. With Iceberg, we have that option with MOR. Another thing to consider is the throttling issue with object stores in general (including S3). S3 and other cloud storage services throttle requests based on object prefix. So, you need to ensure that you have different prefixes. Iceberg provides capability out of the box to generate a deterministic hash for each stored file so files written to S3 are equally distributed among multiple prefixes.
Do you know if the Iceberg Java APIs are comprehensive enough to read and write data rows to tables that could be added later to a catalog? Answer: Yes, so we need to make sure the Java API can work with the specific catalog. We can then create a Catalog instance and then load the tables to further do read & writes. Please note that this is not independent of the catalog, so it depends on what we use.
What is preferred CDC capture streaming process from RDS Postgresql schema table data into S3 Iceberg tables? Flink CDC Connector or Debezium or Kafka/Kinesis? Answer: Largely depends on the tools at your disposal and the use cases. If you have Debezium already then that should be good. If you want to use Kafka, there is a Kafka Connect connector and the same goes for Flink.
Thank you! Will wait for Nessie + Demio for MinIO Deployment. Answer: Yes.
What are the minimum requirements for adding support for reading, updating and writing to a java based ETL tool. Specifically I am looking to add support for Iceberg tables to Apache Hop. Answer: The Iceberg Java API can be used for reading (there is a Scan planning API). Essentially you will have to ask what data files need to be scanned and you will be returned with a list of tasks for files to be scanned. There are also certain Classes for reading Parquet files. On the write side, you will have to figure out what changes you want to make to underlying Parquet files and save it as a flat Parquet table. Then depending on the modifications & write strategy (COW, MOR), Iceberg write side APIs can be used to either append new data or overwrite existing data. And then Iceberg internally will take care of managing the metadata files and pointing the metadata pointer to the latest one. Read more on the API here.