Reddit Reddit reviews Database Internals: A Deep Dive into How Distributed Data Systems Work

We found 1 Reddit comments about Database Internals: A Deep Dive into How Distributed Data Systems Work. Here are the top ones, ranked by their Reddit score.

Computers & Technology
Books
Business Technology
Management Information Systems
Database Internals: A Deep Dive into How Distributed Data Systems Work
Check price on Amazon

1 Reddit comment about Database Internals: A Deep Dive into How Distributed Data Systems Work:

u/parts_of_speech ยท 12 pointsr/datascience

Hey, DE here with lots of experience, and I was self taught. I can be pretty specific about the subfield and what is necessary to know and not know. In an inversion of the normal path I did a mid career M.Sc in CS so it was kind of amusing to see what was and was not relevant in traditional CS. Prestigious C.S. programs prepare you for an academic career in C.S. theory but the down and dirty of moving and processing data use only a specific subset. You can also get a lot done without the theory for a while.

If I had to transition now, I'd look into a bootcamp program like Insight Data Engineering. At least look at their syllabus. In terms of CS fundamentals... https://teachyourselfcs.com/ offers a list of resources you can use over the years to fill in the blanks. They put you in front of employers, force you to finish a demo project.

Data Engineering is more fundamentally operational in nature that most software engineering You care a lot about things happening reliably across multiple systems, and when using many systems the fragility increases a lot. A typical pipeline can cross a hundred actual computers and 3 or 4 different frameworks.doesn't need a lot of it. (Also I'm doing the inverse transition as you... trying to understand multivariate time series right now)

I have trained jr coders to be come data engineers and I focus a lot on Operating System fundamentals: network, memory, processes. Debugging systems is a different skill set than debugging code, it's often much more I/O centric. It's very useful to be quick on the command line too as you are often shelling in to diagnose what's happening on this computer or that. Checking 'top', 'netstat', grepping through logs. Distributed systems are a pain. Data Eng in production is like 1/4 linux sysadmin.

It's good to be a language polyglot. (python, bash commands, SQL, Java)

Those massive java stack traces are less intimidating when you know that Java's design encourages lots of deep class hierarchies, and every library you import introduces a few layers to the stack trace. But usually the meat and potatoes method you need to look at is at the top of a given thread. Scala is only useful because of Spark, and the level of Scala you need to know for Spark is small compared to the full extent of the language. Mostly you are programatically configuring a computation graph.

Kleppman's book is a great way to skip to relevant things in large system design.

https://www.amazon.com/Designing-Data-Intensive-Applications-Reliable-Maintainable/dp/1449373321

It's very worth understanding how relational databases work because all the big distributed systems are basically subsets of relational database functionality, compromised for the sake of the distributed-ness. The fundamental concepts of how the data is partitioned, written to disk, caching, indexing, query optimization and transaction handling all apply. Whether the input is SQL or Spark, you are usually generate the same few fundamental operations (google Relational Algebra) and asking the system to execute it the best way it knows how. We face the same data issues now we did in the 70s but at a larger scale.

Keeping up with the framework or storage product fashion show is a lot easier when you have these fundamentals. I used Ramakrishnan, Database Management Systems. But anything that puts you in the position of asking how database systems work from the inside is extremely relevant even for "big data" distributed systems.

https://www.amazon.com/Database-Management-Systems-Raghu-Ramakrishnan/dp/0072465638

I also saw this recently and by the ToC it covers lots of stuff.

https://www.amazon.com/Database-Internals-Deep-Distributed-Systems-ebook/dp/B07XW76VHZ/ref=sr_1_1?keywords=database+internals&qid=1568739274&s=gateway&sr=8-1

But to keep in mind... the designers of these big data systems all had a thorough grounding in the issues of single node relational databases systems. It's very clarifying to see things through that lens.