世界/PythonTricks-并发与并行

Posted on 2020-12-06

子进程可以并行，

>>> import shlex, subprocess
>>> command_line = input()
/bin/vikings -input eggs.txt -output "spam spam.txt" -cmd "echo '$MONEY'"
>>> args = shlex.split(command_line)
>>> print(args)
['/bin/vikings', '-input', 'eggs.txt', '-output', 'spam spam.txt', '-cmd', "echo '$MONEY'"]
>>> p = subprocess.Popen(args) # Success!

系统/流系统

Posted on 2020-12-06

大数据/get_jason_object问题

Posted on 2020-12-06

get_json_object使用的是堆外内存，默认堆外内存只有max( executorMemory * 0.10），可以考虑通过

—conf “spark.yarn.executor.memoryOverhead=4G” 设置堆外内存。

https://blog.csdn.net/weixin_43267534/article/details/100978755

大数据/Spark-Sreaming

Posted on 2020-12-06

Spark Streaming: Micro-batch 小批量

Structured Streaming:

大数据/kafka

Posted on 2020-12-06

初识Kafka

目前 Kafka 已经定位为一个分布式流式处理平台，它以高吞吐、可持久化、可水平扩展、支持流数据处理等多种特性而被广泛使用。目前越来越多的开源分布式处理系统如 Cloudera、Storm、Spark、Flink 等都支持与 Kafka 集成。

消息系统： Kafka 和传统的消息系统（也称作消息中间件）都具备系统解耦、冗余存储、流量削峰、缓冲、异步通信、扩展性、可恢复性等功能。与此同时，Kafka 还提供了大多数消息系统难以实现的消息顺序性保障及回溯消费的功能。
存储系统： Kafka 把消息持久化到磁盘，相比于其他基于内存存储的系统而言，有效地降低了数据丢失的风险。也正是得益于 Kafka 的消息持久化功能和多副本机制，我们可以把 Kafka 作为长期的数据存储系统来使用，只需要把对应的数据保留策略设置为“永久”或启用主题的日志压缩功能即可。
流式处理平台： Kafka 不仅为每个流行的流式处理框架提供了可靠的数据来源，还提供了一个完整的流式处理类库，比如窗口、连接、变换和聚合等各类操作

BUGS/BUGS

Posted on 2020-12-06

VSCODE:支持c++11

Go to Settings > User Settings In here, search for Run Code Configuration:

Under this menu, find: "code-runner.executorMap"

Edit this Setting by adding it to User Setting as below for C++11 support:

Tensorflow2-CustomMetrics

Posted on 2020-12-01 Edited on 2020-12-06 In Tensorflow

Keras未实现F1-score

以前Keras的metrics是错的，原因是这是非streaming的计算方式；

Metrics

定义一个MSE作为metric如下，这是一个scalar常量值，并且在training或evaluation的时候，每个epoch看到的结果是该epoch下的每个batch的平均值；（is the average of the per-batch metric values for all batches see during a given epoch）

Tensorflow2-AutoGraph

Posted on 2020-10-01 Edited on 2021-03-17 In Tensorflow

三种计算图

有三种计算图的构建方式：静态计算图，动态计算图，以及Autograph.

在TensorFlow1.0时代，采用的是静态计算图，需要先使用TensorFlow的各种算子创建计算图，然后再开启一个会话Session，显式执行计算图。

而在TensorFlow2.0时代，采用的是动态计算图，即每使用一个算子后，该算子会被动态加入到隐含的默认计算图中立即执行得到结果，而无需开启Session

Tensorflow2-DistributedTraining

Posted on 2020-10-01 Edited on 2021-03-14 In Tensorflow

分布式训练

The code here is similar to the multi-GPU training tutorial with one key difference:

when using Estimator for multi-worker training, it is necessary to shard the dataset by the number of workers to ensure model convergence. （ multi-worker 模式下的分布式模式下，作为包证模型收敛的手段，数据集切割分配到多个worker上。）

Tensorflow2-CallBack

Posted on 2020-10-01 Edited on 2021-03-17 In Tensorflow

Keras callbacks overview

所有的callbacks子类都继承自 keras.callbacks.Callback 类；

可以把callbacks的lists传递给如下的接口：（参数名为 callbacks）

keras.Model.fit()
keras.Model.evaluate()
keras.Model.predict()