Mon blog-notes à moi que j'ai

Blog personnel d'un sysadmin, tendance hacker

Compilation veille Twitter & RSS #2016-32

La moisson de liens pour la semaine du 8 au 12 août 2016. Ils ont, pour la plupart, été publiés sur mon compte Twitter. Les voici rassemblés pour ceux qui les auraient raté.

Bonne lecture

Security & Privacy

On Cybersecurity and Being Targeted
Last month, I was the subject of a targeted cyber attack. Someone went through substantial lengths to attempt to gain access to my GitHub account, but were thankfully unsuccessful because two-factor authentication was enabled.
Keep cyber criminals at bay, use 2FA!
One of the easiest ways to better protect your online accounts is using something called 2FA, or 2 Factor Authentication. Don’t worry, it’s not difficult to setup or hard to use but it will pretty much stop cyber criminals being able to access your accounts!
Easily Improving Linux Security with Two-Factor Authentication
2-Factor Authentication (2FA) is a simple way to help improve the security of your systems. It restricts the scope of damage if a machine is compromised. If, for instance, you have a security token or authenticator app on your phone that is required for ssh to a remote machine, then even if every laptop you use to connect to the remote is totally owned, an attacker cannot establish a new ssh session on their own.
Using PKCS#11 on GNU/Linux
PKCS#11 is a standard API to interface with HSMs, Smart Cards, or other types of random hardware backed crypto. On my travel laptop, I use a few Yubikeys in PKCS#11 mode using OpenSC to handle system login. libpam-pkcs11 is a pretty easy to use module that will let you log into your system locally using a PKCS#11 token locally.

System Engineering

This is strictly a violation of the TCP specification
I was asked to debug another weird issue on our network. Apparently every now and then a connection going through CloudFlare would time out with 522 HTTP error.
CloudFront Log Analysis Using the Logz.io ELK Stack
Content Delivery Networks (CDNs) play a crucial role in how the Web works today by allowing application developers to deliver content to end users with high levels of availability and performance.
MWoS 2015: Let’s Encrypt Automation Tooling
The Mozilla Winter of Security of 2015 has ended, and the participating teams of students are completing their projects.
The Certificate Automation tooling for Let’s Encrypt project wrapped up this month, having produced an experimental proof-of-concept patch for the Nginx webserver to tightly integrate the ACME automated certificate management protocol into the server operation.
The MWoS team, my co-mentor Richard Barnes, and I would like to thank Klaus Krapfenbauer, his advisor Martin Schmiedecker, and the Technical University of Vienna for all the excellent research and work on this project.
Below is Klaus’ end-of-project presentation on AirMozilla, as well as further details on the project.
Protecting Netflix Viewing Privacy at Scale
On the Open Connect team at Netflix, we are always working to enhance the hardware and software in the purpose-built Open Connect Appliances (OCAs) that store and serve Netflix video content. As we mentioned in a recent company blog post, since the beginning of the Open Connect program we have significantly increased the efficiency of our OCAs - from delivering 8 Gbps of throughput from a single server in 2012 to over 90 Gbps from a single server in 2016. We contribute to this effort on the software side by optimizing every aspect of the software for our unique use case - in particular, focusing on the open source FreeBSD operating system and the NGINX web server that run on the OCAs.

Software Engineering

Demystifying Continuous Integration, Delivery, and Deployment
For some, the practice of continuous integration (CI) and continuous delivery/deployment (CD) is part of daily life and comes as second nature. However, as I learned while attending a couple of conferences recently, there are still many who aren’t utilizing any form of automated testing, let alone CI/CD. Those not already practicing CI/CD expressed the desire to but either didn’t know where to start or lacked the support of their employers to invest time in it.
Automated testing on devices
As part of the Netflix SDK team, our responsibility is to ensure the new release version of the Netflix application is thoroughly tested to its highest operational quality before deploying onto gaming consoles and distributing as an SDK (along with a reference application) to Netflix device partners; eventually making its way to millions of smart TV’s and set top boxes (STB’s). Overall, our testing is responsible for the quality of Netflix running on millions of gaming consoles and internet connected TV’s/STB’s.
Fast and Accurate Document Detection for Scanning
A few weeks ago, Dropbox launched a set of new productivity tools including document scanning on iOS. This new feature allows users to scan documents with their smartphone camera and store those scans directly in their Dropbox. The feature automatically detects the document in the frame, extracts it from the background, fits it to a rectangular shape, removes shadows and adjusts the contrast, and finally saves it to a PDF file. For Dropbox Business users, we also run Optical Character Recognition (OCR) to recognize the text in the document for search and copy-pasting.
git-push-all
I maintain Debian packages for several projects which are hosted on GitHub. I have a master packaging branch containing both upstream’s code, and my debian/ subdirectory containing the packaging control files. When upstream makes a new release, I simply merge their release tag into master: git merge 1.2.3 (after reviewing the diff!).

Databases Engineering

Aggregation features, Elasticsearch vs. MySQL (vs. MongoDB)
To make the MySQL Document Store primary programming interface, the X DevAPI, a success we should provide building blocks to solve common web development tasks, for example, faceted search. But is there any chance a relational database bend towards a document store can compete around aggregation (grouping) features? Comparing Elasticsearch, MySQL and MongoDB gives mixed results. GROUP BY on JSON gets you pretty far but you need to watch out for pitfalls…
Use the index, JSON! Aggregation performance (Elastic, MySQL, MongoDB)
Use the index, Luke!? No: use the index, JSON! When hunting features the X DevAPI should consider I got shocked by some JSON query execution times. Sometimes the MySQL Document Store runs head-on-head with MongoDB, sometimes not. Sometimes the difference to Elasticsearch is huge, sometimes – just – acceptable. Everytime the index rules. Use the index, MySQL Document Store users!

Elasticsearch

Elasticsearch: Verifying Data Integrity with External Data Stores
Elasticsearch is sometimes used alongside other databases. In those scenarios, it’s often hard to implement solutions surrounding two-phase commits due to the lack of transaction support across all systems in play. Depending on your use case, it may or may not be necessary to verify the data exists in both data stores, where one serves as the so-called « source of truth » for the other.

MySQL & MariaDB

Tracker: Ingesting MySQL data at scale - Part 1
At Pinterest we’re building the world’s most comprehensive discovery engine, and part of achieving a highly personalized, relevant and fast service is running thousands of jobs on our Hadoop/Spark cluster. To feed the data for computation, we need to ingest a large volume of raw data from online data sources such as MySQL, Kafka and Redis. We’ve previously covered our logging pipeline and moving Kafka data onto S3. Here we’ll share lessons learned in moving data at scale from MySQL to S3, and our journey in implementing Tracker, a database ingestion system to move content at massive scale.
Small innodb_page_size as a performance boost for SSD
In this blog post, we’ll discuss how a small innodb_page_size can create a performance boost for SSD.
In my previous post Testing Samsung storage in tpcc-mysql benchmark of Percona Server I compared different Samsung devices. Most solid state drives (SSDs) use 4KiB as an internal page size, and the InnoDB default page size is 16KiB. I wondered how using a different innodb_page_size might affect the overall performance.
Tuning MySQL Group Replication for fun… and profit!
Group Replication introduces a new way to do replication in MySQL. With great features such as multi-master replication it brings a range of exciting deployment scenarios where some difficult problems become much easier to solve. Please check a few recent blog posts about group replication here, here, and here.
Group Replication brings a new set of options that may need to be configured to extract the highest performance from the underlying computing resources. In this blog post we will explain the components that specifically affect the performance of Group Replication and how they can be optimized.

Cassandra

How We Scaled Our Ad Analytics with Cassandra
On the Ad Backend team, we recently moved our ad analytics data from MySQL to Cassandra. Here’s why we thought Cassandra was a good fit for our application, and some lessons we learned that you might find useful if you’re thinking about using Cassandra!

Vertica

What’s New in 7.2.3: Apache Parquet Reader
Vertica 7.2.3 introduces the ability to natively read Apache Hadoop files stored in Parquet format. This feature coupled with the other Hadoop integration new features (covered in this previous blog post) let you:

Data Engineering & Analytic

More Than Just a Schema Store
This is the third post in a series covering Yelp’s real-time streaming data infrastructure. Our series explores in-depth how we stream MySQL updates in real-time with an exactly-once guarantee, how we automatically track & migrate schemas, how we process and transform streams, and finally how we connect all of this into datastores like Redshift and Salesforce.
Supervised Learning - Comprehensive Tutorial (Python-based)
This article is from Scikits learn. Scikit-learn Machine Learning in Python is simple and efficient tools for data mining and data analysis. Accessible to everybody, and reusable in various contexts. Built on NumPy, SciPy, and matplotlib. Open source, commercially usable - BSD license.

Network Engineering

Beacon termination at the edge
In the past, I’ve given a number of talks about dealing with uncacheable content and using CDNs to extend the application to the edge by treating them as part of your stack. In these talks, I use real-world examples to discuss some of these topics. Using a CDN for beacon termination at the edge is the one example that has gotten the most amount of attention and generated the most questions. This is partially because edge termination for beacons is cool; but it’s mostly because beaconing applications are becoming more popular and they involve a lot of components, deployed at scale, to accommodate proper data collection. These applications are primarily used for analytics, monitoring, and targeting — all of which have become vital elements in today’s modern business operations — making their deployment critical. Using a CDN with edge caches and the right combinations of features can help application deployment and the collection of crucial data, so I thought it’d be a good idea to finally write about it.
IPv6 Inside LinkedIn Part II
With the exception of link-local addresses and the now-deprecated site-local addresses, all IPv6 addresses are globally routable. In IPv4 RFC1918 space, a certain level of comfort was felt knowing that these private networks should not be able to « leak » to the global internet. With IPv6, however, this isn’t guaranteed, as all global IPv6 space is routable globally. Data center designs now need to implement more robust security policies, as traffic can originate both internally and externally. It’s no longer feasible to have a simple policy that only permits or denies RFC1918 addresses.